Skip to content

Middleware

Middlewares are hooks that sit between core components, allowing you to modify request/response processing, filter data, handle errors, and extend qCrawl's behavior without modifying the core codebase.

Overview

qCrawl has two types of middlewares:

DownloaderMiddleware - Hooks around HTTP download:

  • Modify requests before they're sent (add headers, authentication)
  • Process responses after download (filter, retry, redirect)
  • Handle exceptions during download (network errors, timeouts)

SpiderMiddleware - Hooks around spider processing:

  • Filter initial requests from start_requests()
  • Process responses before spider.parse()
  • Filter Items and Requests yielded by spider
  • Handle exceptions during parsing

Both types execute in chains with configurable priority ordering.

Downloader Middleware

Downloader middlewares wrap the HTTP download process.

Hooks

process_request(request, spider)

Called before the request is sent to the downloader.

Use cases:

  • Add authentication headers
  • Set cookies
  • Add custom headers
  • Modify request parameters
  • Log outgoing requests

Returns: MiddlewareResult enum

  • CONTINUE - Pass request to next middleware
  • KEEP - Stop chain, send this request to downloader
  • DROP - Drop request entirely (don't download)

Example:

from qcrawl.middleware.base import DownloaderMiddleware, MiddlewareResult

class AuthMiddleware(DownloaderMiddleware):
    async def process_request(self, request, spider):
        # Add authentication token
        request.headers["Authorization"] = f"Bearer {spider.api_token}"
        return MiddlewareResult.CONTINUE

process_response(request, response, spider)

Called after the downloader returns a response.

Use cases:

  • Filter responses by status code
  • Detect and handle errors
  • Transform response content
  • Log responses
  • Trigger retries

Returns: MiddlewareResult enum

  • CONTINUE - Pass response to next middleware
  • KEEP - Stop chain, send this response to spider
  • RETRY - Retry the request
  • DROP - Drop response (don't send to spider)

Example:

class StatusCodeMiddleware(DownloaderMiddleware):
    async def process_response(self, request, response, spider):
        # Drop 404 responses
        if response.status_code == 404:
            spider.logger.warning(f"Page not found: {response.url}")
            return MiddlewareResult.DROP

        return MiddlewareResult.CONTINUE

process_exception(request, exception, spider)

Called when an exception occurs during download.

Use cases:

  • Handle network errors
  • Log exceptions
  • Retry failed requests
  • Return fallback responses

Returns: MiddlewareResult enum

  • CONTINUE - Pass to next middleware
  • RETRY - Retry the request
  • DROP - Drop request

Example:

import aiohttp
from qcrawl.middleware.base import DownloaderMiddleware, MiddlewareResult

class NetworkErrorMiddleware(DownloaderMiddleware):
    async def process_exception(self, request, exception, spider):
        if isinstance(exception, aiohttp.ClientError):
            spider.logger.error(f"Network error for {request.url}: {exception}")
            return MiddlewareResult.RETRY

        return MiddlewareResult.CONTINUE

Execution order

Downloader middlewares execute in two phases:

Request phase (before download):

MW1.process_request → MW2.process_request → MW3.process_request → Downloader

Response phase (after download, reversed order):

MW3.process_response ← MW2.process_response ← MW1.process_response ← Downloader

Lower priority number = executed first in request phase, last in response phase.

Spider Middleware

Spider middlewares wrap spider processing.

Hooks

process_start_requests(start_requests, spider)

Called with initial requests from spider.start_requests().

Use cases:

  • Filter initial URLs
  • Add metadata to start requests
  • Transform URLs
  • Limit initial requests

Parameters:

  • start_requests - Async generator of initial Requests

Returns: Async generator of Requests

Example:

from qcrawl.middleware.base import SpiderMiddleware

class StartRequestsFilterMiddleware(SpiderMiddleware):
    async def process_start_requests(self, start_requests, spider):
        async for request in start_requests:
            # Only crawl .com domains
            if ".com" in request.url:
                yield request

process_spider_input(response, spider)

Called before spider.parse() receives the response.

Use cases:

  • Validate response before parsing
  • Add metadata to response
  • Filter responses
  • Log incoming responses

Returns: MiddlewareResult enum

  • CONTINUE - Pass response to next middleware
  • DROP - Drop response (don't parse)

Example:

class ResponseValidationMiddleware(SpiderMiddleware):
    async def process_spider_input(self, response, spider):
        # Drop empty responses
        if not response.text:
            spider.logger.warning(f"Empty response from {response.url}")
            return MiddlewareResult.DROP

        return MiddlewareResult.CONTINUE

process_spider_output(response, result, spider)

Called with each Item or Request yielded by spider.parse().

Use cases:

  • Filter Items or Requests
  • Transform yielded data
  • Add metadata to Items
  • Enforce depth limits
  • Log scraped items

Parameters:

  • response - The response being parsed
  • result - Individual Item or Request yielded by spider

Returns: Item | Request | None

  • Return the result to pass it along
  • Return None to drop it

Example:

class ItemFilterMiddleware(SpiderMiddleware):
    async def process_spider_output(self, response, result, spider):
        from qcrawl.core.item import Item

        # Filter out items without title
        if isinstance(result, Item):
            if not result.data.get("title"):
                spider.logger.debug("Dropping item without title")
                return None

        return result

process_spider_exception(response, exception, spider)

Called when spider.parse() raises an exception.

Use cases:

  • Log parsing errors
  • Handle specific exceptions
  • Return fallback Items/Requests
  • Skip problematic pages

Returns: List of Items/Requests or empty list

Example:

class ParsingErrorMiddleware(SpiderMiddleware):
    async def process_spider_exception(self, response, exception, spider):
        spider.logger.error(
            f"Error parsing {response.url}: {exception}",
            exc_info=True
        )
        # Return empty list to skip this page
        return []

Execution order

Spider middlewares execute in two phases:

Input phase (before parse):

MW1.process_spider_input → MW2.process_spider_input → MW3.process_spider_input → Spider.parse()

Output phase (after parse, reversed order):

MW3.process_spider_output ← MW2.process_spider_output ← MW1.process_spider_output ← Spider.parse()

Middleware Results

Middlewares return MiddlewareResult enum to control execution flow:

from qcrawl.middleware.base import MiddlewareResult

class MyMiddleware(DownloaderMiddleware):
    async def process_request(self, request, spider):
        # Continue to next middleware
        return MiddlewareResult.CONTINUE

        # Stop chain, use current value
        return MiddlewareResult.KEEP

        # Retry the request
        return MiddlewareResult.RETRY

        # Drop request/response
        return MiddlewareResult.DROP

When to use each:

  • CONTINUE - Default, pass to next middleware
  • KEEP - Stop chain early (optimization)
  • RETRY - Request failed, retry with backoff
  • DROP - Filter out unwanted requests/responses

Registering Middlewares

In spider settings

from qcrawl.core.spider import Spider

class MySpider(Spider):
    name = "my_spider"
    start_urls = ["https://example.com"]

    custom_settings = {
        "DOWNLOADER_MIDDLEWARES": {
            "myproject.middlewares.AuthMiddleware": 100,
            "myproject.middlewares.StatusCodeMiddleware": 200,
        },
        "SPIDER_MIDDLEWARES": {
            "myproject.middlewares.ItemFilterMiddleware": 300,
        }
    }

In global settings

# settings.py or pyproject.toml
DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.AuthMiddleware": 100,
    "myproject.middlewares.CustomMiddleware": 500,
}

SPIDER_MIDDLEWARES = {
    "myproject.middlewares.DepthMiddleware": 100,
}

Priority numbers

  • Lower number = executed first in request/input phase
  • Higher number = executed first in response/output phase
  • Built-in middlewares use 100, 200, 300, etc.
  • Leave gaps for custom middlewares

Example ordering:

Priority 100: MW1
Priority 200: MW2
Priority 500: MW3

Request flow:  MW1 → MW2 → MW3 → Downloader
Response flow: MW3 ← MW2 ← MW1 ← Downloader

Middleware with State

Initialization with from_crawler

Middlewares can access crawler components via from_crawler():

class StatefulMiddleware(DownloaderMiddleware):
    def __init__(self, settings, stats):
        self.max_retries = settings.get("MAX_RETRIES", 3)
        self.stats = stats

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            settings=crawler.settings,
            stats=crawler.stats
        )

    async def process_request(self, request, spider):
        self.stats.inc_value("custom/requests")
        return MiddlewareResult.CONTINUE

Spider-specific state

Store state per spider instance:

class RateLimitMiddleware(DownloaderMiddleware):
    def __init__(self):
        self.request_times = {}  # domain → last request time

    async def process_request(self, request, spider):
        from urllib.parse import urlparse
        import time

        domain = urlparse(request.url).netloc
        last_time = self.request_times.get(domain, 0)
        current_time = time.time()

        # Wait if too soon
        delay = getattr(spider, "download_delay", 1.0)
        if current_time - last_time < delay:
            await asyncio.sleep(delay - (current_time - last_time))

        self.request_times[domain] = time.time()
        return MiddlewareResult.CONTINUE

Best Practices

Design

  • Single responsibility: Each middleware should do one thing well
  • Fail gracefully: Handle errors without breaking the crawl
  • Be defensive: Validate inputs and handle edge cases
  • Document behavior: Explain what your middleware does and when to use it

Performance

  • Minimize overhead: Middlewares execute on every request/response
  • Use async operations: Don't block the event loop
  • Cache when possible: Avoid repeated computations
  • Profile impact: Measure middleware overhead

State management

  • Use from_crawler(): Access settings, stats, signals
  • Thread-safe state: Multiple workers access middlewares concurrently
  • Clean up resources: Implement cleanup if needed

Configuration

  • Use settings: Make behavior configurable
  • Provide defaults: Sensible defaults for all options
  • Document settings: Explain all configuration options
  • Validate settings: Check required settings exist

Testing

  • Unit test logic: Test middleware behavior in isolation
  • Mock dependencies: Don't require real HTTP requests
  • Test edge cases: Handle errors, empty responses, etc.
  • Integration test: Verify middleware works in full crawl

See also