Browser Automation
Many modern websites rely heavily on JavaScript to render content, making traditional HTTP requests insufficient for scraping. qCrawl supports browser automation through the Camoufox downloader, a stealth browser based on Firefox that can bypass anti-bot detection systems.
Installation
Install qCrawl with Camoufox support:
pip install qcrawl[camoufox]
Basic Usage
Enable the Camoufox downloader by configuring it in your spider's custom_settings:
from qcrawl.core.spider import Spider
from qcrawl.core.request import Request
from qcrawl.core.page import PageMethod
class BrowserSpider(Spider):
name = "browser_spider"
start_urls = ["https://quotes.toscrape.com/js/"]
custom_settings = {
# Register Camoufox downloader
"DOWNLOAD_HANDLERS": {
"http": "qcrawl.downloaders.HTTPDownloader",
"https": "qcrawl.downloaders.HTTPDownloader",
"camoufox": "qcrawl.downloaders.CamoufoxDownloader",
},
# Configure browser contexts
"CAMOUFOX_CONTEXTS": {
"default": {
"viewport": {"width": 1280, "height": 720},
}
},
# Browser pool settings
"CAMOUFOX_MAX_CONTEXTS": 2,
"CAMOUFOX_MAX_PAGES_PER_CONTEXT": 3,
"CAMOUFOX_LAUNCH_OPTIONS": {"headless": True},
}
async def start_requests(self):
for url in self.start_urls:
yield Request(
url=url,
meta={
"use_handler": "camoufox",
"camoufox_page_methods": [
PageMethod("wait_for_selector", ".quote")
]
}
)
async def parse(self, response):
# Parse JavaScript-rendered HTML
rv = self.response_view(response)
quotes = rv.doc.cssselect("div.quote")
for q in quotes:
text = q.cssselect("span.text")[0].text_content().strip()
author = q.cssselect("small.author")[0].text_content().strip()
yield {"text": text, "author": author}
Configuration Settings
Browser Contexts
Contexts define different browser environments (viewport, user agent, etc.). You can create multiple contexts for different use cases:
"CAMOUFOX_CONTEXTS": {
"default": {
"viewport": {"width": 1280, "height": 720},
"ignore_https_errors": False,
},
"mobile": {
"viewport": {"width": 375, "height": 667},
"user_agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)",
},
"tablet": {
"viewport": {"width": 768, "height": 1024},
}
}
Specify which context to use in the request meta:
yield Request(
url="https://example.com",
meta={
"use_handler": "camoufox",
"camoufox_context": "mobile" # Use mobile context
}
)
Browser Pool Settings
Control the number of browser instances and concurrent pages:
"CAMOUFOX_MAX_CONTEXTS": 10, # Maximum browser contexts
"CAMOUFOX_MAX_PAGES_PER_CONTEXT": 5, # Max pages per context
"CAMOUFOX_DEFAULT_NAVIGATION_TIMEOUT": 30000.0, # Timeout in milliseconds
Launch Options
Configure browser launch parameters:
"CAMOUFOX_LAUNCH_OPTIONS": {
"headless": True, # Run without GUI
"args": [
"--no-sandbox",
"--disable-dev-shm-usage",
],
}
Remote Browser Connection
Connect to a remote browser via Chrome DevTools Protocol (CDP):
"CAMOUFOX_CDP_URL": "http://localhost:9222", # Remote browser URL
Page Interactions
Execute page methods before or after navigation using PageMethod objects:
from qcrawl.core.page import PageMethod
yield Request(
url="https://example.com",
meta={
"use_handler": "camoufox",
"camoufox_page_methods": [
# Wait for selector after page loads
PageMethod("wait_for_selector", ".content"),
# Click a button
PageMethod("click", "#load-more"),
# Wait for content to load
PageMethod("wait_for_timeout", 2000),
# Execute JavaScript
PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
# Take a screenshot with keyword arguments
PageMethod("screenshot", path="/tmp/page.png", full_page=True),
]
}
)
Timing control:
By default, methods execute after navigation. Use the timing parameter for before-navigation execution:
# Execute JavaScript before navigating to the page
PageMethod("evaluate", "navigator.webdriver = false", timing="before")
# Wait for element after navigation (default)
PageMethod("wait_for_selector", ".content") # timing="after" is default
Complex Custom Interactions
For complex custom page interactions, use camoufox_include_page=True to get direct access to the page object in your parse method. See the Manual Page Control section below.
Accessing results:
Page method results are stored and accessible in the parse method:
async def parse(self, response):
# Access page methods with their results
methods = response.meta.get("camoufox_page_methods", [])
# Get screenshot result (if screenshot method was used)
for method in methods:
if method.method == "screenshot":
print(f"Screenshot saved, result: {method.result}")
Common page methods:
wait_for_selector(selector): Wait for element to appearwait_for_timeout(ms): Wait for specified millisecondsclick(selector): Click an elementfill(selector, text): Fill an input fieldevaluate(js_code): Execute JavaScript in the page contextscreenshot(path, full_page=True): Take a screenshot
Dict format (backward compatible):
For config files, you can still use dict format:
{
"method": "wait_for_selector",
"args": [".content"],
"timing": "after"
}
Manual Page Control
For complex interactions, request the page object and control it manually:
class InteractiveSpider(Spider):
def start_requests(self):
yield Request(
url="https://example.com",
meta={
"use_handler": "camoufox",
"camoufox_include_page": True # Keep page object alive
}
)
async def parse(self, response):
if "camoufox_page" in response.meta:
page = response.meta["camoufox_page"]
# Perform custom interactions
await page.click("#button")
await page.wait_for_selector(".results")
content = await page.content()
# IMPORTANT: Close the page when done
await page.close()
# Continue parsing
rv = self.response_view(response)
# ...
Resource Management
When using camoufox_include_page, you must manually close the page with await page.close() to avoid resource leaks.
Request Header Processing
Control how qCrawl headers are sent to the browser:
"CAMOUFOX_PROCESS_REQUEST_HEADERS": "use_qcrawl_headers" # Default
Options:
"use_qcrawl_headers": Merge qCrawl headers with browser requests (default)"ignore": Don't send qCrawl headers (use browser defaults only)- Custom callable:
lambda request, default_headers: {...}for custom processing
Anti-Bot Evasion
Disclaimer
qCrawl is intended for ethical and research purposes only. Users should use this software at their own risk.
Camoufox provides browser-level anti-detection features. For framework-level anti-bot capabilities (rate limiting, proxies, user agent rotation, etc.), see the Anti-Bot Evasion Guide.
Camoufox-Specific Configuration
Camoufox provides browser-level anti-detection features:
1. Stealth Mode
Automatically patches WebDriver detection and fingerprinting:
custom_settings = {
"DOWNLOAD_HANDLERS": {
"camoufox": "qcrawl.downloaders.CamoufoxDownloader",
},
# Stealth mode is enabled by default - no configuration needed
# Automatically patches:
# - navigator.webdriver detection
# - Chrome DevTools Protocol traces
# - Automation-specific properties
}
2. Browser Fingerprint Randomization
Each context gets a unique, realistic browser fingerprint:
custom_settings = {
"CAMOUFOX_CONTEXTS": {
"default": {
"viewport": {"width": 1920, "height": 1080},
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
},
"mobile": {
"viewport": {"width": 390, "height": 844},
"user_agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_0 like Mac OS X)",
},
},
}
3. Headless Detection Bypass
Configure headless mode with anti-detection:
custom_settings = {
"CAMOUFOX_LAUNCH_OPTIONS": {
"headless": True, # Runs without GUI but appears as normal browser
"args": [
"--disable-blink-features=AutomationControlled", # Remove automation flags
"--disable-dev-shm-usage", # Improve stability
"--no-sandbox", # Linux compatibility
],
},
}
4. Canvas and WebGL Fingerprinting
Camoufox randomizes canvas and WebGL fingerprints automatically:
# No configuration needed - handled automatically
# Each context generates unique but realistic:
# - Canvas fingerprints
# - WebGL vendor/renderer strings
# - Audio context fingerprints
5. HTTPS Error Handling
Control certificate validation for testing environments:
custom_settings = {
"CAMOUFOX_CONTEXTS": {
"default": {
"ignore_https_errors": True, # For testing/development only
}
},
}
6. Browser Context Isolation
Isolate different scraping targets with separate contexts:
custom_settings = {
"CAMOUFOX_CONTEXTS": {
"site_a": {"viewport": {"width": 1920, "height": 1080}},
"site_b": {"viewport": {"width": 1366, "height": 768}},
},
}
# Use different contexts per domain
yield Request(
url="https://site-a.com",
meta={"use_handler": "camoufox", "camoufox_context": "site_a"}
)
yield Request(
url="https://site-b.com",
meta={"use_handler": "camoufox", "camoufox_context": "site_b"}
)
Navigation Options
Customize page navigation behavior via camoufox_page_goto_kwargs:
yield Request(
url="https://example.com",
meta={
"use_handler": "camoufox",
"camoufox_page_goto_kwargs": {
"wait_until": "networkidle", # Wait for network to be idle
"timeout": 60000, # Custom timeout
"referer": "https://google.com" # Set referer
}
}
)
wait_until options:
"domcontentloaded": Wait for DOM to be ready (default)"load": Wait for load event"networkidle": Wait for network to be idle
Event Handlers
Register event handlers for browser events:
def handle_console(msg):
print(f"Browser console: {msg.text}")
def handle_dialog(dialog):
dialog.dismiss()
yield Request(
url="https://example.com",
meta={
"use_handler": "camoufox",
"camoufox_event_handlers": {
"console": handle_console,
"dialog": handle_dialog
}
}
)
Request Abortion
Abort requests matching certain patterns to save bandwidth:
def should_abort(route_request):
# Block images and stylesheets
resource_type = route_request.resource_type
return resource_type in ["image", "stylesheet", "font"]
custom_settings = {
"CAMOUFOX_ABORT_REQUEST": should_abort
}
Performance Tips
- Use HTTP downloader when possible: Only use Camoufox for JavaScript-rendered pages or when anti-bot evasion required
- Limit browser contexts: Keep
CAMOUFOX_MAX_CONTEXTSlow - Abort unnecessary requests: Block images/fonts if not needed
- Reuse contexts: Use
camoufox_contextto reuse browser contexts - Increase delays: Use higher
DELAY_PER_DOMAINfor browser requests
Debugging
Enable Browser Visibility
Run browser in non-headless mode to see what's happening:
"CAMOUFOX_LAUNCH_OPTIONS": {"headless": False}
Take Screenshots
Capture screenshots for debugging:
from qcrawl.core.page import PageMethod
"camoufox_page_methods": [
PageMethod("screenshot", path="/tmp/debug.png", full_page=True)
]
Enable Logging
Set log level to see browser activity:
custom_settings = {
"LOG_LEVEL": "DEBUG"
}