10 mins overview

Introduction

qCrawl is a fast async web crawling framework for Python that makes it easy to extract structured data from websites.

What qCrawl provides:

Async architecture - High-performance concurrent crawling based on asyncio
Powerful parsing - Built-in CSS/XPath selectors with lxml
Middleware system - Customizable request/response processing with downloader and spider middlewares
Flexible export - Multiple output formats including JSON, CSV, XML
Flexible queue backends - Memory, disk, or Redis-based schedulers for different scale requirements
Item pipelines - Data transformation, validation, and processing before export

Your first spider

Let's start with the simplest possible spider and build from there.

quotes_spider.py

from qcrawl.core.spider import Spider

class QuotesSpider(Spider):
    name = "quotes_spider"
    start_urls = ["https://quotes.toscrape.com/"]

    async def parse(self, response):

        # ResponseView provides convenient parsing methods:
        # CSS/XPath selectors, link extraction, and URL resolution
        rv = self.response_view(response)

        # rv.doc exposes a lazy-loaded lxml document tree for efficient parsing
        for quote in rv.doc.cssselect('div.quote'):
            yield {
                "text": quote.cssselect('span.text')[0].text_content(),
                "author": quote.cssselect('small.author')[0].text_content(),
            }

Run it:

# Default - prints to console (stdout) in NDJSON format
qcrawl quotes_spider.py:QuotesSpider

The spider visits the URL, extracts quotes, and outputs them to stdout in NDJSON format (one JSON object per line).

Understanding the response

Let's explore what qCrawl provides in the response object:

async def parse(self, response):
    # Access response properties
    print(f"URL: {response.url}")
    print(f"Status: {response.status_code}")
    print(f"Headers: {response.headers}")

    # Get response view for parsing
    rv = self.response_view(response)

    # Access the parsed HTML document
    print(f"Title: {rv.doc.cssselect('title')[0].text_content()}")

    # Extract all links
    links = [a.get('href') for a in rv.doc.cssselect('a')]
    print(f"Found {len(links)} links")

Key response properties:

response.url - Final URL (after redirects)
response.status_code - HTTP status (200, 404, etc.)
response.headers - Response headers dict
response.text - Raw HTML content
response.json() - Parse JSON responses

ResponseView (rv) provides:

rv.doc - Lazy-loaded lxml document tree for CSS/XPath selectors
rv.follow(href, priority=0, meta=None) - Create new Request from relative or absolute URL
rv.urljoin(url) - Resolve relative URL to absolute URL (returns string)
rv.response - Access to the underlying Response object
rv.spider - Access to the current Spider instance

Adding configuration

Customize spider behavior with settings:

class QuotesSpider(Spider):
    name = "quotes_spider"
    start_urls = ["https://quotes.toscrape.com/"]

    # Spider-specific settings
    custom_settings = {
        "CONCURRENCY": 5,
        "DELAY_PER_DOMAIN": 0.5,
        "USER_AGENT": "MyBot/1.0",
        "TIMEOUT": 30.0,
    }

    async def parse(self, response):
        rv = self.response_view(response)

        for quote in rv.doc.cssselect('div.quote'):
            text_elem = quote.cssselect('span.text')
            author_elem = quote.cssselect('small.author')

            # Skip if elements are missing
            if not text_elem or not author_elem:
                continue

            yield {
                "text": text_elem[0].text_content().strip(),
                "author": author_elem[0].text_content().strip(),
                "url": response.url,
            }

Common settings:

CONCURRENCY - Number of concurrent requests
DELAY_PER_DOMAIN - Delay between requests to same domain (seconds)
USER_AGENT - Browser user agent string
TIMEOUT - Request timeout in seconds
MAX_RETRIES - Retry failed requests

For a full list of settings, see the settings reference.

Handling errors

Add error handling for robust spiders:

from cssselect import SelectorError

class QuotesSpider(Spider):
    name = "quotes_spider"
    start_urls = ["https://quotes.toscrape.com/"]

    async def parse(self, response):
        # Check response status
        if response.status_code != 200:
            self.logger.warning(f"Non-200 status: {response.status_code}")
            return

        rv = self.response_view(response)

        # Safe selector usage
        try:
            quotes = rv.doc.cssselect('div.quote')
        except SelectorError as e:
            self.logger.error(f"Invalid selector: {e}")
            return

        if not quotes:
            self.logger.info("No quotes found on page")
            return

        for quote in quotes:
            try:
                text = quote.cssselect('span.text')[0].text_content().strip()
                author = quote.cssselect('small.author')[0].text_content().strip()

                yield {
                    "text": text,
                    "author": author,
                    "url": response.url,
                }
            except (IndexError, AttributeError) as e:
                # Skip malformed quotes
                self.logger.debug(f"Skipping quote due to: {e}")
                continue

Error handling patterns:

Check response.status_code before processing
Use try/except for selector operations
Validate elements exist before accessing
Log warnings/errors for debugging
Use continue to skip bad items, return to skip page

For advanced error handling, see error recovery.

Following links and pagination

Crawl multiple pages by following links:

class QuotesSpider(Spider):
    name = "quotes_spider"
    start_urls = ["https://quotes.toscrape.com/"]

    async def parse(self, response):
        rv = self.response_view(response)

        # Extract items from current page
        for quote in rv.doc.cssselect('div.quote'):
            text_elem = quote.cssselect('span.text')
            author_elem = quote.cssselect('small.author')

            if text_elem and author_elem:
                yield {
                    "text": text_elem[0].text_content().strip(),
                    "author": author_elem[0].text_content().strip(),
                    "url": response.url,
                }

        # Follow "next" link for pagination
        next_links = rv.doc.cssselect('li.next a')
        if next_links:
            href = next_links[0].get('href')
            if href:
                # rv.follow() resolves relative URLs and preserves context
                yield rv.follow(href)

Link following patterns:

rv.follow(href) - Follow relative or absolute URL
self.follow(response, href) - Alternative syntax
Yields new requests back to the scheduler
Same parse() method handles all pages

For advanced pagination techniques, see pagination strategies.

Exporting data

Save scraped data to files:

# Export to JSON
qcrawl --export quotes.json --export-format json quotes_spider.py:QuotesSpider

# Export to CSV
qcrawl --export quotes.csv --export-format csv quotes_spider.py:QuotesSpider

# Export to XML
qcrawl --export quotes.xml --export-format xml quotes_spider.py:QuotesSpider

Export formats:

json - JSON array of items
ndjson - One JSON object per line (NDJSON format)
csv - CSV with headers
xml - XML format

Complete example

Putting it all together:

from cssselect import SelectorError
from qcrawl.core.spider import Spider
from qcrawl.core.item import Item

class QuotesSpider(Spider):
    name = "quotes_spider"
    start_urls = ["https://quotes.toscrape.com/"]

    custom_settings = {
        "CONCURRENCY": 5,
        "DELAY_PER_DOMAIN": 0.5,
        "USER_AGENT": "QuoteBot/1.0",
        "MAX_DEPTH": 0,  # Unlimited depth
    }

    async def parse(self, response):
        """Extract quotes and follow pagination."""

        # Validate response
        if response.status_code != 200:
            self.logger.warning(f"Non-200 status: {response.status_code}")
            return

        rv = self.response_view(response)

        # Safe selector usage
        try:
            quotes = rv.doc.cssselect('div.quote')
        except SelectorError as e:
            self.logger.error(f"Selector error: {e}")
            return

        # Extract quotes
        for quote in quotes:
            try:
                text = quote.cssselect('span.text')[0].text_content().strip()
                author = quote.cssselect('small.author')[0].text_content().strip()
                tags = [t.text_content().strip()
                       for t in quote.cssselect('div.tags a.tag')]

                yield Item(
                    data={
                        "text": text,
                        "author": author,
                        "tags": tags,
                        "url": response.url,
                    },
                    metadata={"source": "quotes.toscrape.com"}
                )
            except (IndexError, AttributeError) as e:
                self.logger.debug(f"Skipping malformed quote: {e}")
                continue

        # Follow pagination
        next_links = rv.doc.cssselect('li.next a')
        if next_links:
            href = next_links[0].get('href')
            if href:
                yield rv.follow(href)

Run with export:

qcrawl --export quotes.json --export-format json quotes_spider.py:QuotesSpider

What’s next?

Ready to get started? Install qCrawl, review examples, and join the community on Discord for support.
Thanks for your interest!