Anti-Bot Evasion

Disclaimer

qCrawl is intended for ethical and research purposes only. Users should use this software at their own risk.

Modern websites employ sophisticated anti-bot detection systems to prevent automated scraping. qCrawl provides a comprehensive set of framework-level features to help you avoid detection while respecting website policies and legal boundaries.

qCrawl's anti-bot capabilities work at the framework level and apply to all downloaders (HTTP, Camoufox, etc.).

These features help you:

Mimic human behavior - Natural request patterns and timing
Avoid fingerprinting - Rotate user agents, headers, and proxies
Handle rate limiting - Respect server capacity with intelligent throttling
Manage sessions - Automatic cookie and header handling
Recover gracefully - Smart retry strategies with exponential backoff

qCrawl Anti-Bot Capabilities

1. Rate Limiting and Throttling

Control request rate to mimic human browsing patterns:

custom_settings = {
    # Global concurrency limit
    "CONCURRENCY": 3,

    # Per-domain concurrency (prevents overwhelming single domain)
    "CONCURRENCY_PER_DOMAIN": 1,

    # Delay between requests to same domain (seconds)
    "DELAY_PER_DOMAIN": 2.0,

    # Random delay range (adds randomness to delays)
    "RANDOMIZE_DOWNLOAD_DELAY": True,

    # Download delay (global, across all domains)
    "DOWNLOAD_DELAY": 1.0,
}

Key Settings:

CONCURRENCY: Maximum concurrent requests across all domains
CONCURRENCY_PER_DOMAIN: Maximum concurrent requests per domain (most important for anti-bot)
DELAY_PER_DOMAIN: Minimum delay between requests to same domain
RANDOMIZE_DOWNLOAD_DELAY: Adds random variance to delays (0.5x to 1.5x)
DOWNLOAD_DELAY: Global minimum delay between any requests

Example Spider:

from qcrawl.core.spider import Spider

class PoliteSpider(Spider):
    name = "polite_spider"

    custom_settings = {
        "CONCURRENCY_PER_DOMAIN": 1,  # One request at a time per domain
        "DELAY_PER_DOMAIN": 3.0,       # 3 seconds between requests
        "RANDOMIZE_DOWNLOAD_DELAY": True,  # Vary timing naturally
    }

2. User Agent Rotation

Rotate user agents to avoid fingerprinting:

import random
from qcrawl.core.spider import Spider
from qcrawl.core.request import Request

class RotatingUserAgentSpider(Spider):
    name = "rotating_ua_spider"

    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    }

    # Pool of realistic user agents
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 14.2; rv:121.0) Gecko/20100101 Firefox/121.0",
    ]

    async def start_requests(self):
        for url in self.start_urls:
            yield Request(
                url=url,
                headers={"User-Agent": random.choice(self.user_agents)}
            )

Best Practices:

Use recent, realistic user agent strings
Match user agent to other headers (Accept, Accept-Language, etc.)
Consider matching viewport size for browser automation
Don't rotate too frequently (appears suspicious)

Cookies are automatically managed by qCrawl's CookiesMiddleware:

custom_settings = {
    # Enable cookie handling (enabled by default)
    "DOWNLOADER_MIDDLEWARES": {
        "qcrawl.middleware.downloader.cookies.CookiesMiddleware": 700,
    },

    # Cookies persist across requests to same domain automatically
    # No manual cookie handling needed
}

Features:

Automatic cookie storage per domain
Sends cookies with subsequent requests
Handles Set-Cookie headers
Respects cookie expiration
Thread-safe cookie jar

Manual Cookie Control (if needed):

from qcrawl.core.request import Request

# Set cookies for a specific request
yield Request(
    url="https://example.com",
    headers={"Cookie": "session_id=abc123; user_pref=dark_mode"}
)

# Access cookies in parse method
def parse(self, response):
    cookies = response.headers.get("Set-Cookie")
    # Process cookies if needed

4. Proxy Support

Rotate proxies to distribute requests across different IP addresses:

custom_settings = {
    "DOWNLOADER_MIDDLEWARES": {
        "qcrawl.middleware.downloader.httpproxy.HttpProxyMiddleware": 750,
    },
}

# Per-request proxy
yield Request(
    url="https://example.com",
    meta={"proxy": "http://proxy.example.com:8080"}
)

Proxy Rotation Example:

import random
from qcrawl.core.spider import Spider
from qcrawl.core.request import Request

class ProxyRotatingSpider(Spider):
    name = "proxy_spider"

    # Pool of proxy servers
    proxies = [
        "http://proxy1.example.com:8080",
        "http://proxy2.example.com:8080",
        "http://proxy3.example.com:8080",
    ]

    custom_settings = {
        "DOWNLOADER_MIDDLEWARES": {
            "qcrawl.middleware.downloader.httpproxy.HttpProxyMiddleware": 750,
        },
    }

    async def start_requests(self):
        for url in self.start_urls:
            yield Request(
                url=url,
                meta={"proxy": random.choice(self.proxies)}
            )

Proxy Authentication:

# HTTP Basic Auth
yield Request(
    url="https://example.com",
    meta={"proxy": "http://username:password@proxy.example.com:8080"}
)

5. Retry Strategy with Backoff

Intelligent retry behavior mimics human patterns and handles transient errors:

custom_settings = {
    "DOWNLOADER_MIDDLEWARES": {
        "qcrawl.middleware.downloader.retry.RetryMiddleware": 550,
    },

    # Retry configuration
    "RETRY_TIMES": 3,  # Maximum retry attempts

    # HTTP status codes that trigger retry
    "RETRY_HTTP_CODES": [500, 502, 503, 504, 408, 429],

    # Priority adjustment for retried requests (negative = lower priority)
    "RETRY_PRIORITY_ADJUST": -1,
}

How It Works:

Request fails with retryable status code (e.g., 503 Service Unavailable)
Request is re-queued with lower priority
Natural delay occurs before retry (due to queue processing)
Process repeats up to RETRY_TIMES attempts

Custom Retry Logic:

from qcrawl.core.request import Request

# Disable retry for specific request
yield Request(
    url="https://example.com",
    meta={"dont_retry": True}
)

# Custom retry count for specific request
yield Request(
    url="https://example.com",
    meta={"max_retry_times": 5}
)

6. Custom Request Headers

Customize headers per request or globally to appear more legitimate:

custom_settings = {
    # Default headers for all requests
    "DEFAULT_REQUEST_HEADERS": {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",  # Do Not Track
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
    },
}

Per-Request Headers:

from qcrawl.core.request import Request

# Add referer to appear as if coming from another page
yield Request(
    url="https://example.com/page2",
    headers={
        "Referer": "https://example.com/page1",
        "X-Requested-With": "XMLHttpRequest",  # AJAX request
    }
)

Common Header Patterns:

# Standard browser headers
BROWSER_HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

# Mobile browser headers
MOBILE_HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_0 like Mac OS X) AppleWebKit/605.1.15",
}

# API client headers
API_HEADERS = {
    "Accept": "application/json",
    "Content-Type": "application/json",
    "X-Requested-With": "XMLHttpRequest",
}

Use PageMethod with Camoufox to simulate human-like interactions:

import random
from qcrawl.core.request import Request
from qcrawl.core.page import PageMethod

yield Request(
    url="https://example.com",
    meta={
        "use_handler": "camoufox",
        "camoufox_page_methods": [
            # Wait for page to load
            PageMethod("wait_for_selector", "body"),

            # Simulate reading time (2-5 seconds)
            PageMethod("wait_for_timeout", random.randint(2000, 5000)),

            # Scroll like a human (partial scroll first)
            PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight / 2)"),
            PageMethod("wait_for_timeout", random.randint(500, 1500)),

            # Scroll to bottom
            PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
            PageMethod("wait_for_timeout", random.randint(500, 1000)),

            # Random mouse movement (if needed)
            PageMethod("evaluate", f"document.elementFromPoint({random.randint(100, 500)}, {random.randint(100, 500)})"),
        ],
    }
)

See Browser Automation for more details on PageMethod.

Combining with Camoufox

For maximum stealth, combine qCrawl's framework-level features with Camoufox's browser-level anti-detection:

import random
from qcrawl.core.spider import Spider
from qcrawl.core.request import Request
from qcrawl.core.page import PageMethod

class StealthSpider(Spider):
    """Maximum stealth spider combining all techniques."""

    name = "stealth_spider"
    start_urls = ["https://example.com"]

    # User agent pool
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    ]

    # Proxy pool (optional)
    proxies = [
        "http://proxy1.example.com:8080",
        "http://proxy2.example.com:8080",
    ]

    custom_settings = {
        # Camoufox configuration
        "DOWNLOAD_HANDLERS": {
            "http": "qcrawl.downloaders.HTTPDownloader",
            "camoufox": "qcrawl.downloaders.CamoufoxDownloader",
        },
        "CAMOUFOX_CONTEXTS": {
            "default": {
                "viewport": {"width": 1920, "height": 1080},
            }
        },
        "CAMOUFOX_LAUNCH_OPTIONS": {
            "headless": True,
            "args": ["--disable-blink-features=AutomationControlled"],
        },

        # Rate limiting (qCrawl framework-level)
        "CONCURRENCY": 2,
        "CONCURRENCY_PER_DOMAIN": 1,
        "DELAY_PER_DOMAIN": 3.0,
        "RANDOMIZE_DOWNLOAD_DELAY": True,

        # Headers (qCrawl framework-level)
        "DEFAULT_REQUEST_HEADERS": {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "DNT": "1",
        },

        # Retry with backoff (qCrawl framework-level)
        "RETRY_TIMES": 3,
        "RETRY_HTTP_CODES": [500, 502, 503, 504, 408, 429],

        # Middlewares (qCrawl framework-level)
        "DOWNLOADER_MIDDLEWARES": {
            "qcrawl.middleware.downloader.cookies.CookiesMiddleware": 700,
            "qcrawl.middleware.downloader.retry.RetryMiddleware": 550,
            "qcrawl.middleware.downloader.httpproxy.HttpProxyMiddleware": 750,
        },
    }

    async def start_requests(self):
        for url in self.start_urls:
            # Rotate user agent and proxy
            user_agent = random.choice(self.user_agents)
            proxy = random.choice(self.proxies) if self.proxies else None

            meta = {
                "use_handler": "camoufox",
                "camoufox_page_methods": [
                    # Wait for content
                    PageMethod("wait_for_selector", "body"),

                    # Simulate human reading time
                    PageMethod("wait_for_timeout", random.randint(2000, 4000)),

                    # Random scroll pattern
                    PageMethod("evaluate", f"window.scrollTo(0, {random.randint(200, 800)})"),
                    PageMethod("wait_for_timeout", random.randint(500, 1500)),
                ],
            }

            if proxy:
                meta["proxy"] = proxy

            yield Request(
                url=url,
                headers={"User-Agent": user_agent},
                meta=meta
            )

    async def parse(self, response):
        # Extract data
        rv = self.response_view(response)
        # ... your parsing logic

        # Follow links with same stealth techniques
        for link in rv.doc.cssselect("a"):
            href = link.get("href")
            if href:
                yield rv.follow(
                    href,
                    headers={"User-Agent": random.choice(self.user_agents)},
                    meta={
                        "use_handler": "camoufox",
                        "camoufox_page_methods": [
                            PageMethod("wait_for_timeout", random.randint(1000, 3000)),
                        ],
                    }
                )

Best Practices

1. Start Slow, Scale Gradually

# Start with conservative settings
initial_settings = {
    "CONCURRENCY": 1,
    "DELAY_PER_DOMAIN": 5.0,
}

# Gradually increase if no issues
optimized_settings = {
    "CONCURRENCY": 3,
    "DELAY_PER_DOMAIN": 2.0,
}

2. Monitor Response Codes

Watch for signs of detection:

class MonitoringSpider(Spider):
    async def parse(self, response):
        # Log suspicious responses
        if response.status_code == 429:  # Too Many Requests
            self.logger.warning(f"Rate limited at {response.url}")
        elif response.status_code == 403:  # Forbidden
            self.logger.warning(f"Access denied at {response.url}")

        # Adjust behavior based on response
        if response.status_code in [429, 503]:
            # Slow down
            self.custom_settings["DELAY_PER_DOMAIN"] *= 1.5

3. Respect robots.txt

custom_settings = {
    "DOWNLOADER_MIDDLEWARES": {
        "qcrawl.middleware.downloader.robotstxt.RobotsTxtMiddleware": 100,
    },
}

4. Use Appropriate Delays

Rule of thumb: - Light scraping (few pages): 1-2 seconds delay - Medium scraping (hundreds of pages): 2-5 seconds delay - Heavy scraping (thousands of pages): 5-10 seconds delay

5. Rotate Everything

Don't just rotate one parameter - rotate multiple aspects:

# Good: Rotate multiple parameters
yield Request(
    url=url,
    headers={
        "User-Agent": random.choice(user_agents),
        "Accept-Language": random.choice(["en-US,en;q=0.9", "en-GB,en;q=0.9"]),
    },
    meta={
        "proxy": random.choice(proxies),
        "use_handler": random.choice(["http", "camoufox"]),
    }
)

6. Time Your Requests

Scrape during off-peak hours when possible:

import datetime

def should_scrape_now():
    """Scrape during off-peak hours (e.g., 2-6 AM target timezone)."""
    hour = datetime.datetime.now().hour
    return 2 <= hour <= 6

if should_scrape_now():
    # Run spider
    pass

7. Handle CAPTCHAs Gracefully

async def parse(self, response):
    if "captcha" in response.text.lower():
        self.logger.warning(f"CAPTCHA detected at {response.url}")
        # Option 1: Stop scraping
        return

        # Option 2: Use CAPTCHA solving service (if authorized)
        # captcha_solution = await solve_captcha(response)

        # Option 3: Switch to different proxy/user agent
        # yield retry_with_different_identity(response.request)

8. Implement Backoff on Detection

class AdaptiveSpider(Spider):
    def __init__(self):
        super().__init__()
        self.detection_count = 0
        self.base_delay = 2.0

    async def parse(self, response):
        if self.is_detected(response):
            self.detection_count += 1
            new_delay = self.base_delay * (2 ** self.detection_count)
            self.logger.warning(f"Detection! Increasing delay to {new_delay}s")
            self.custom_settings["DELAY_PER_DOMAIN"] = new_delay
        else:
            # Gradually decrease delay if successful
            if self.detection_count > 0:
                self.detection_count -= 1

    def is_detected(self, response):
        return response.status_code in [429, 403] or "captcha" in response.text.lower()

Common Anti-Bot Patterns and Solutions

Pattern 1: Rate Limiting (429 Response)

Detection: Server returns 429 Too Many Requests

Solution:

custom_settings = {
    "DELAY_PER_DOMAIN": 5.0,
    "CONCURRENCY_PER_DOMAIN": 1,
    "RETRY_HTTP_CODES": [429],
    "RETRY_TIMES": 5,
}

Pattern 2: IP-Based Blocking

Detection: 403 Forbidden or connection refused

Solution:

# Use proxy rotation
custom_settings = {
    "DOWNLOADER_MIDDLEWARES": {
        "qcrawl.middleware.downloader.httpproxy.HttpProxyMiddleware": 750,
    },
}

# Rotate proxies per request
yield Request(url=url, meta={"proxy": random.choice(proxy_pool)})

Pattern 3: JavaScript Challenge

Detection: Page requires JavaScript execution

Solution:

# Use Camoufox for JavaScript rendering
yield Request(
    url=url,
    meta={
        "use_handler": "camoufox",
        "camoufox_page_methods": [
            PageMethod("wait_for_selector", ".content"),
        ],
    }
)

Pattern 4: User Agent Filtering

Detection: Different content for different user agents

Solution:

# Use realistic, current user agents
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/120.0.0.0",
]

yield Request(url=url, headers={"User-Agent": random.choice(user_agents)})

Pattern 5: Session/Cookie Requirements

Detection: Access denied without valid session

Solution:

# Let CookiesMiddleware handle sessions automatically
custom_settings = {
    "DOWNLOADER_MIDDLEWARES": {
        "qcrawl.middleware.downloader.cookies.CookiesMiddleware": 700,
    },
}

# Or manually handle login
async def start_requests(self):
    # Login first
    yield Request(
        url="https://example.com/login",
        method="POST",
        body={"username": "...", "password": "..."},
        callback=self.after_login
    )

async def after_login(self, response):
    # Cookies are automatically stored, proceed with scraping
    yield Request(url="https://example.com/data")

Monitoring and Debugging

Enable Detailed Logging

custom_settings = {
    "LOG_LEVEL": "DEBUG",
}

# In spider
self.logger.info(f"Successfully scraped {response.url}")
self.logger.warning(f"Retrying {response.url} - attempt {retry_count}")

Track Success Rates

class MonitoredSpider(Spider):
    def __init__(self):
        super().__init__()
        self.success_count = 0
        self.failure_count = 0

    async def parse(self, response):
        if response.status_code == 200:
            self.success_count += 1
        else:
            self.failure_count += 1

        # Log metrics
        total = self.success_count + self.failure_count
        success_rate = self.success_count / total if total > 0 else 0
        self.logger.info(f"Success rate: {success_rate:.2%}")

Summary

qCrawl provides comprehensive anti-bot evasion capabilities:

Feature	Purpose	Key Settings
Rate Limiting	Mimic human speed	`DELAY_PER_DOMAIN`, `CONCURRENCY_PER_DOMAIN`
User Agent Rotation	Avoid fingerprinting	Per-request headers
Cookie Management	Session handling	`CookiesMiddleware` (automatic)
Proxy Support	IP rotation	`HttpProxyMiddleware`, per-request meta
Retry Strategy	Handle transient errors	`RETRY_TIMES`, `RETRY_HTTP_CODES`
Custom Headers	Appear legitimate	`DEFAULT_REQUEST_HEADERS`, per-request
Navigation Patterns	Human-like behavior	PageMethod with Camoufox

Anti-Bot Evasion

qCrawl Anti-Bot Capabilities

1. Rate Limiting and Throttling

2. User Agent Rotation

3. Cookie Management

4. Proxy Support

5. Retry Strategy with Backoff

6. Custom Request Headers

7. Realistic Navigation Patterns

Combining with Camoufox

Best Practices

1. Start Slow, Scale Gradually

2. Monitor Response Codes

3. Respect robots.txt

4. Use Appropriate Delays

5. Rotate Everything

6. Time Your Requests

7. Handle CAPTCHAs Gracefully

8. Implement Backoff on Detection

Common Anti-Bot Patterns and Solutions

Pattern 1: Rate Limiting (429 Response)

Pattern 2: IP-Based Blocking

Pattern 3: JavaScript Challenge

Pattern 4: User Agent Filtering

Pattern 5: Session/Cookie Requirements

Monitoring and Debugging

Enable Detailed Logging

Track Success Rates

Summary