Spiders
A Spider is a class which defines what to crawl (a site or a group of sites), how to perform the crawl (i.e. follow links) and how to extract structured data from the pages (i.e. scraping items).
Basic concepts
qCrawl spiders should subclass qcrawl.core.spider.Spider, override the async parse() method, and define the name and start_urls attributes.
Class attributes:
name: str — unique spider identifier (required).-
start_urls: list[str] — initial URLs to crawl (required). -
custom_settings: dict — spider-specific settings that override global project settings. allowed_domains: list[str] — restrict crawling to these domains.
parse(response) — async generator that can yield:
Item(container for scraped fields and internal metadata) or plain dict (engine wraps intoItem).Request(data class representing an HTTP crawl request).strURL (engine or middlewares convert toRequest).
The parse() method processes all downloaded pages.
It receives a Page object representing the HTTP response, and yields Item objects, Request objects, or string URLs.
Simple example spider that uses css selectors and yields Item:
from datetime import datetime, timezone
from qcrawl.core.spider import Spider, ResponseView
from qcrawl.core.response import Page
from qcrawl.core.item import Item
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
async def parse(self, response: Page):
# ResponseView provides helper methods for parsing:
# CSS/XPath selectors, link extraction, url resolution
rv = self.response_view(response)
# `response_view(response).doc` exposes a lazy-loaded `lxml` document tree
# for selectors guide see: https://www.qcrawl.org/concepts/selectors/
for q in rv.doc.cssselect(".quote"):
text_nodes = q.cssselect("span.text")
author_nodes = q.cssselect("small.author")
if not text_nodes or not author_nodes:
continue
text = text_nodes[0].text_content().strip()
author = author_nodes[0].text_content().strip()
ts = datetime.now(timezone.utc).isoformat()
yield Item(
data={"text": text, "author": author},
metadata={"scraped_at": ts})
next_link = rv.doc.cssselect("li.next a")
if next_link:
href = next_link[0].get("href")
if href:
yield rv.follow(response, href)
Scraping lifecycle
flowchart LR
Spider -->|"yield Request / URL / Item"| Scheduler
Scheduler -->|"next Request"| Engine
Engine -->|"fetch"| Downloader
Downloader -->|"Response"| Engine
Engine -->|"call parse(response)"| Spider
Spider -.->|"yield Item"| Export@{ shape: bow-rect, label: "Export process" }
The simplified scraping cycle works as follows:
- You generate the initial requests to crawl the first URLs. These requests come from the
start_requests()method of your spider, which by default yields a Request for each URL in thestart_urls: list[str]. - Each request is placed in the scheduler's queue. The engine pulls the next request, sends it to the downloader, and waits for the response. Once downloaded, the response is passed back to the engine.
- The engine calls your spider's
parse(response)method with the downloaded response. Inside this method, you parse the page content (using CSS selectors, XPath) and yield eitherItemobjects containing extracted data or newRequestobjects for additional URLs to crawl. - Any yielded
Request/URL/Itemobject are returned to the scheduler, enqueued, and processed in the same way — forming a continuous loop until no more requests remain. - Any yielded
Itemobjects are sent to the export process: item pipelines (drop, transform), exporters (data formating), and storage backends (save data).