Anti-Bot Evasion
Disclaimer
qCrawl is intended for ethical and research purposes only. Users should use this software at their own risk.
Modern websites employ sophisticated anti-bot detection systems to prevent automated scraping. qCrawl provides a comprehensive set of framework-level features to help you avoid detection while respecting website policies and legal boundaries.
qCrawl's anti-bot capabilities work at the framework level and apply to all downloaders (HTTP, Camoufox, etc.).
These features help you:
- Mimic human behavior - Natural request patterns and timing
- Avoid fingerprinting - Rotate user agents, headers, and proxies
- Handle rate limiting - Respect server capacity with intelligent throttling
- Manage sessions - Automatic cookie and header handling
- Recover gracefully - Smart retry strategies with exponential backoff
qCrawl Anti-Bot Capabilities
1. Rate Limiting and Throttling
Control request rate to mimic human browsing patterns:
custom_settings = {
# Global concurrency limit
"CONCURRENCY": 3,
# Per-domain concurrency (prevents overwhelming single domain)
"CONCURRENCY_PER_DOMAIN": 1,
# Delay between requests to same domain (seconds)
"DELAY_PER_DOMAIN": 2.0,
# Random delay range (adds randomness to delays)
"RANDOMIZE_DOWNLOAD_DELAY": True,
# Download delay (global, across all domains)
"DOWNLOAD_DELAY": 1.0,
}
Key Settings:
CONCURRENCY: Maximum concurrent requests across all domainsCONCURRENCY_PER_DOMAIN: Maximum concurrent requests per domain (most important for anti-bot)DELAY_PER_DOMAIN: Minimum delay between requests to same domainRANDOMIZE_DOWNLOAD_DELAY: Adds random variance to delays (0.5x to 1.5x)DOWNLOAD_DELAY: Global minimum delay between any requests
Example Spider:
from qcrawl.core.spider import Spider
class PoliteSpider(Spider):
name = "polite_spider"
custom_settings = {
"CONCURRENCY_PER_DOMAIN": 1, # One request at a time per domain
"DELAY_PER_DOMAIN": 3.0, # 3 seconds between requests
"RANDOMIZE_DOWNLOAD_DELAY": True, # Vary timing naturally
}
2. User Agent Rotation
Rotate user agents to avoid fingerprinting:
import random
from qcrawl.core.spider import Spider
from qcrawl.core.request import Request
class RotatingUserAgentSpider(Spider):
name = "rotating_ua_spider"
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
}
# Pool of realistic user agents
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14.2; rv:121.0) Gecko/20100101 Firefox/121.0",
]
async def start_requests(self):
for url in self.start_urls:
yield Request(
url=url,
headers={"User-Agent": random.choice(self.user_agents)}
)
Best Practices:
- Use recent, realistic user agent strings
- Match user agent to other headers (Accept, Accept-Language, etc.)
- Consider matching viewport size for browser automation
- Don't rotate too frequently (appears suspicious)
3. Cookie Management
Cookies are automatically managed by qCrawl's CookiesMiddleware:
custom_settings = {
# Enable cookie handling (enabled by default)
"DOWNLOADER_MIDDLEWARES": {
"qcrawl.middleware.downloader.cookies.CookiesMiddleware": 700,
},
# Cookies persist across requests to same domain automatically
# No manual cookie handling needed
}
Features:
- Automatic cookie storage per domain
- Sends cookies with subsequent requests
- Handles Set-Cookie headers
- Respects cookie expiration
- Thread-safe cookie jar
Manual Cookie Control (if needed):
from qcrawl.core.request import Request
# Set cookies for a specific request
yield Request(
url="https://example.com",
headers={"Cookie": "session_id=abc123; user_pref=dark_mode"}
)
# Access cookies in parse method
def parse(self, response):
cookies = response.headers.get("Set-Cookie")
# Process cookies if needed
4. Proxy Support
Rotate proxies to distribute requests across different IP addresses:
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"qcrawl.middleware.downloader.httpproxy.HttpProxyMiddleware": 750,
},
}
# Per-request proxy
yield Request(
url="https://example.com",
meta={"proxy": "http://proxy.example.com:8080"}
)
Proxy Rotation Example:
import random
from qcrawl.core.spider import Spider
from qcrawl.core.request import Request
class ProxyRotatingSpider(Spider):
name = "proxy_spider"
# Pool of proxy servers
proxies = [
"http://proxy1.example.com:8080",
"http://proxy2.example.com:8080",
"http://proxy3.example.com:8080",
]
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"qcrawl.middleware.downloader.httpproxy.HttpProxyMiddleware": 750,
},
}
async def start_requests(self):
for url in self.start_urls:
yield Request(
url=url,
meta={"proxy": random.choice(self.proxies)}
)
Proxy Authentication:
# HTTP Basic Auth
yield Request(
url="https://example.com",
meta={"proxy": "http://username:password@proxy.example.com:8080"}
)
5. Retry Strategy with Backoff
Intelligent retry behavior mimics human patterns and handles transient errors:
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"qcrawl.middleware.downloader.retry.RetryMiddleware": 550,
},
# Retry configuration
"RETRY_TIMES": 3, # Maximum retry attempts
# HTTP status codes that trigger retry
"RETRY_HTTP_CODES": [500, 502, 503, 504, 408, 429],
# Priority adjustment for retried requests (negative = lower priority)
"RETRY_PRIORITY_ADJUST": -1,
}
How It Works:
- Request fails with retryable status code (e.g., 503 Service Unavailable)
- Request is re-queued with lower priority
- Natural delay occurs before retry (due to queue processing)
- Process repeats up to
RETRY_TIMESattempts
Custom Retry Logic:
from qcrawl.core.request import Request
# Disable retry for specific request
yield Request(
url="https://example.com",
meta={"dont_retry": True}
)
# Custom retry count for specific request
yield Request(
url="https://example.com",
meta={"max_retry_times": 5}
)
6. Custom Request Headers
Customize headers per request or globally to appear more legitimate:
custom_settings = {
# Default headers for all requests
"DEFAULT_REQUEST_HEADERS": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1", # Do Not Track
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
},
}
Per-Request Headers:
from qcrawl.core.request import Request
# Add referer to appear as if coming from another page
yield Request(
url="https://example.com/page2",
headers={
"Referer": "https://example.com/page1",
"X-Requested-With": "XMLHttpRequest", # AJAX request
}
)
Common Header Patterns:
# Standard browser headers
BROWSER_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
# Mobile browser headers
MOBILE_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_0 like Mac OS X) AppleWebKit/605.1.15",
}
# API client headers
API_HEADERS = {
"Accept": "application/json",
"Content-Type": "application/json",
"X-Requested-With": "XMLHttpRequest",
}
7. Realistic Navigation Patterns
Use PageMethod with Camoufox to simulate human-like interactions:
import random
from qcrawl.core.request import Request
from qcrawl.core.page import PageMethod
yield Request(
url="https://example.com",
meta={
"use_handler": "camoufox",
"camoufox_page_methods": [
# Wait for page to load
PageMethod("wait_for_selector", "body"),
# Simulate reading time (2-5 seconds)
PageMethod("wait_for_timeout", random.randint(2000, 5000)),
# Scroll like a human (partial scroll first)
PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight / 2)"),
PageMethod("wait_for_timeout", random.randint(500, 1500)),
# Scroll to bottom
PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
PageMethod("wait_for_timeout", random.randint(500, 1000)),
# Random mouse movement (if needed)
PageMethod("evaluate", f"document.elementFromPoint({random.randint(100, 500)}, {random.randint(100, 500)})"),
],
}
)
See Browser Automation for more details on PageMethod.
Combining with Camoufox
For maximum stealth, combine qCrawl's framework-level features with Camoufox's browser-level anti-detection:
import random
from qcrawl.core.spider import Spider
from qcrawl.core.request import Request
from qcrawl.core.page import PageMethod
class StealthSpider(Spider):
"""Maximum stealth spider combining all techniques."""
name = "stealth_spider"
start_urls = ["https://example.com"]
# User agent pool
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
]
# Proxy pool (optional)
proxies = [
"http://proxy1.example.com:8080",
"http://proxy2.example.com:8080",
]
custom_settings = {
# Camoufox configuration
"DOWNLOAD_HANDLERS": {
"http": "qcrawl.downloaders.HTTPDownloader",
"camoufox": "qcrawl.downloaders.CamoufoxDownloader",
},
"CAMOUFOX_CONTEXTS": {
"default": {
"viewport": {"width": 1920, "height": 1080},
}
},
"CAMOUFOX_LAUNCH_OPTIONS": {
"headless": True,
"args": ["--disable-blink-features=AutomationControlled"],
},
# Rate limiting (qCrawl framework-level)
"CONCURRENCY": 2,
"CONCURRENCY_PER_DOMAIN": 1,
"DELAY_PER_DOMAIN": 3.0,
"RANDOMIZE_DOWNLOAD_DELAY": True,
# Headers (qCrawl framework-level)
"DEFAULT_REQUEST_HEADERS": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
},
# Retry with backoff (qCrawl framework-level)
"RETRY_TIMES": 3,
"RETRY_HTTP_CODES": [500, 502, 503, 504, 408, 429],
# Middlewares (qCrawl framework-level)
"DOWNLOADER_MIDDLEWARES": {
"qcrawl.middleware.downloader.cookies.CookiesMiddleware": 700,
"qcrawl.middleware.downloader.retry.RetryMiddleware": 550,
"qcrawl.middleware.downloader.httpproxy.HttpProxyMiddleware": 750,
},
}
async def start_requests(self):
for url in self.start_urls:
# Rotate user agent and proxy
user_agent = random.choice(self.user_agents)
proxy = random.choice(self.proxies) if self.proxies else None
meta = {
"use_handler": "camoufox",
"camoufox_page_methods": [
# Wait for content
PageMethod("wait_for_selector", "body"),
# Simulate human reading time
PageMethod("wait_for_timeout", random.randint(2000, 4000)),
# Random scroll pattern
PageMethod("evaluate", f"window.scrollTo(0, {random.randint(200, 800)})"),
PageMethod("wait_for_timeout", random.randint(500, 1500)),
],
}
if proxy:
meta["proxy"] = proxy
yield Request(
url=url,
headers={"User-Agent": user_agent},
meta=meta
)
async def parse(self, response):
# Extract data
rv = self.response_view(response)
# ... your parsing logic
# Follow links with same stealth techniques
for link in rv.doc.cssselect("a"):
href = link.get("href")
if href:
yield rv.follow(
href,
headers={"User-Agent": random.choice(self.user_agents)},
meta={
"use_handler": "camoufox",
"camoufox_page_methods": [
PageMethod("wait_for_timeout", random.randint(1000, 3000)),
],
}
)
Best Practices
1. Start Slow, Scale Gradually
# Start with conservative settings
initial_settings = {
"CONCURRENCY": 1,
"DELAY_PER_DOMAIN": 5.0,
}
# Gradually increase if no issues
optimized_settings = {
"CONCURRENCY": 3,
"DELAY_PER_DOMAIN": 2.0,
}
2. Monitor Response Codes
Watch for signs of detection:
class MonitoringSpider(Spider):
async def parse(self, response):
# Log suspicious responses
if response.status_code == 429: # Too Many Requests
self.logger.warning(f"Rate limited at {response.url}")
elif response.status_code == 403: # Forbidden
self.logger.warning(f"Access denied at {response.url}")
# Adjust behavior based on response
if response.status_code in [429, 503]:
# Slow down
self.custom_settings["DELAY_PER_DOMAIN"] *= 1.5
3. Respect robots.txt
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"qcrawl.middleware.downloader.robotstxt.RobotsTxtMiddleware": 100,
},
}
4. Use Appropriate Delays
Rule of thumb: - Light scraping (few pages): 1-2 seconds delay - Medium scraping (hundreds of pages): 2-5 seconds delay - Heavy scraping (thousands of pages): 5-10 seconds delay
5. Rotate Everything
Don't just rotate one parameter - rotate multiple aspects:
# Good: Rotate multiple parameters
yield Request(
url=url,
headers={
"User-Agent": random.choice(user_agents),
"Accept-Language": random.choice(["en-US,en;q=0.9", "en-GB,en;q=0.9"]),
},
meta={
"proxy": random.choice(proxies),
"use_handler": random.choice(["http", "camoufox"]),
}
)
6. Time Your Requests
Scrape during off-peak hours when possible:
import datetime
def should_scrape_now():
"""Scrape during off-peak hours (e.g., 2-6 AM target timezone)."""
hour = datetime.datetime.now().hour
return 2 <= hour <= 6
if should_scrape_now():
# Run spider
pass
7. Handle CAPTCHAs Gracefully
async def parse(self, response):
if "captcha" in response.text.lower():
self.logger.warning(f"CAPTCHA detected at {response.url}")
# Option 1: Stop scraping
return
# Option 2: Use CAPTCHA solving service (if authorized)
# captcha_solution = await solve_captcha(response)
# Option 3: Switch to different proxy/user agent
# yield retry_with_different_identity(response.request)
8. Implement Backoff on Detection
class AdaptiveSpider(Spider):
def __init__(self):
super().__init__()
self.detection_count = 0
self.base_delay = 2.0
async def parse(self, response):
if self.is_detected(response):
self.detection_count += 1
new_delay = self.base_delay * (2 ** self.detection_count)
self.logger.warning(f"Detection! Increasing delay to {new_delay}s")
self.custom_settings["DELAY_PER_DOMAIN"] = new_delay
else:
# Gradually decrease delay if successful
if self.detection_count > 0:
self.detection_count -= 1
def is_detected(self, response):
return response.status_code in [429, 403] or "captcha" in response.text.lower()
Common Anti-Bot Patterns and Solutions
Pattern 1: Rate Limiting (429 Response)
Detection: Server returns 429 Too Many Requests
Solution:
custom_settings = {
"DELAY_PER_DOMAIN": 5.0,
"CONCURRENCY_PER_DOMAIN": 1,
"RETRY_HTTP_CODES": [429],
"RETRY_TIMES": 5,
}
Pattern 2: IP-Based Blocking
Detection: 403 Forbidden or connection refused
Solution:
# Use proxy rotation
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"qcrawl.middleware.downloader.httpproxy.HttpProxyMiddleware": 750,
},
}
# Rotate proxies per request
yield Request(url=url, meta={"proxy": random.choice(proxy_pool)})
Pattern 3: JavaScript Challenge
Detection: Page requires JavaScript execution
Solution:
# Use Camoufox for JavaScript rendering
yield Request(
url=url,
meta={
"use_handler": "camoufox",
"camoufox_page_methods": [
PageMethod("wait_for_selector", ".content"),
],
}
)
Pattern 4: User Agent Filtering
Detection: Different content for different user agents
Solution:
# Use realistic, current user agents
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/120.0.0.0",
]
yield Request(url=url, headers={"User-Agent": random.choice(user_agents)})
Pattern 5: Session/Cookie Requirements
Detection: Access denied without valid session
Solution:
# Let CookiesMiddleware handle sessions automatically
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"qcrawl.middleware.downloader.cookies.CookiesMiddleware": 700,
},
}
# Or manually handle login
async def start_requests(self):
# Login first
yield Request(
url="https://example.com/login",
method="POST",
body={"username": "...", "password": "..."},
callback=self.after_login
)
async def after_login(self, response):
# Cookies are automatically stored, proceed with scraping
yield Request(url="https://example.com/data")
Monitoring and Debugging
Enable Detailed Logging
custom_settings = {
"LOG_LEVEL": "DEBUG",
}
# In spider
self.logger.info(f"Successfully scraped {response.url}")
self.logger.warning(f"Retrying {response.url} - attempt {retry_count}")
Track Success Rates
class MonitoredSpider(Spider):
def __init__(self):
super().__init__()
self.success_count = 0
self.failure_count = 0
async def parse(self, response):
if response.status_code == 200:
self.success_count += 1
else:
self.failure_count += 1
# Log metrics
total = self.success_count + self.failure_count
success_rate = self.success_count / total if total > 0 else 0
self.logger.info(f"Success rate: {success_rate:.2%}")
Summary
qCrawl provides comprehensive anti-bot evasion capabilities:
| Feature | Purpose | Key Settings |
|---|---|---|
| Rate Limiting | Mimic human speed | DELAY_PER_DOMAIN, CONCURRENCY_PER_DOMAIN |
| User Agent Rotation | Avoid fingerprinting | Per-request headers |
| Cookie Management | Session handling | CookiesMiddleware (automatic) |
| Proxy Support | IP rotation | HttpProxyMiddleware, per-request meta |
| Retry Strategy | Handle transient errors | RETRY_TIMES, RETRY_HTTP_CODES |
| Custom Headers | Appear legitimate | DEFAULT_REQUEST_HEADERS, per-request |
| Navigation Patterns | Human-like behavior | PageMethod with Camoufox |