10 mins overview
Introduction
qCrawl is a fast async web crawling framework for Python that makes it easy to extract structured data from websites.
What qCrawl provides:
- Async architecture - High-performance concurrent crawling based on asyncio
- Powerful parsing - Built-in CSS/XPath selectors with lxml
- Middleware system - Customizable request/response processing with downloader and spider middlewares
- Flexible export - Multiple output formats including JSON, CSV, XML
- Flexible queue backends - Memory, disk, or Redis-based schedulers for different scale requirements
- Item pipelines - Data transformation, validation, and processing before export
Your first spider
Let's start with the simplest possible spider and build from there.
from qcrawl.core.spider import Spider
class QuotesSpider(Spider):
name = "quotes_spider"
start_urls = ["https://quotes.toscrape.com/"]
async def parse(self, response):
# ResponseView provides convenient parsing methods:
# CSS/XPath selectors, link extraction, and URL resolution
rv = self.response_view(response)
# rv.doc exposes a lazy-loaded lxml document tree for efficient parsing
for quote in rv.doc.cssselect('div.quote'):
yield {
"text": quote.cssselect('span.text')[0].text_content(),
"author": quote.cssselect('small.author')[0].text_content(),
}
Run it:
# Default - prints to console (stdout) in NDJSON format
qcrawl quotes_spider.py:QuotesSpider
Understanding the response
Let's explore what qCrawl provides in the response object:
async def parse(self, response):
# Access response properties
print(f"URL: {response.url}")
print(f"Status: {response.status_code}")
print(f"Headers: {response.headers}")
# Get response view for parsing
rv = self.response_view(response)
# Access the parsed HTML document
print(f"Title: {rv.doc.cssselect('title')[0].text_content()}")
# Extract all links
links = [a.get('href') for a in rv.doc.cssselect('a')]
print(f"Found {len(links)} links")
Key response properties:
response.url- Final URL (after redirects)response.status_code- HTTP status (200, 404, etc.)response.headers- Response headers dictresponse.text- Raw HTML contentresponse.json()- Parse JSON responses
ResponseView (rv) provides:
rv.doc- Lazy-loaded lxml document tree for CSS/XPath selectorsrv.follow(href, priority=0, meta=None)- Create newRequestfrom relative or absoluteURLrv.urljoin(url)- Resolve relativeURLto absoluteURL(returns string)rv.response- Access to the underlyingResponseobjectrv.spider- Access to the currentSpiderinstance
Adding configuration
Customize spider behavior with settings:
class QuotesSpider(Spider):
name = "quotes_spider"
start_urls = ["https://quotes.toscrape.com/"]
# Spider-specific settings
custom_settings = {
"CONCURRENCY": 5,
"DELAY_PER_DOMAIN": 0.5,
"USER_AGENT": "MyBot/1.0",
"TIMEOUT": 30.0,
}
async def parse(self, response):
rv = self.response_view(response)
for quote in rv.doc.cssselect('div.quote'):
text_elem = quote.cssselect('span.text')
author_elem = quote.cssselect('small.author')
# Skip if elements are missing
if not text_elem or not author_elem:
continue
yield {
"text": text_elem[0].text_content().strip(),
"author": author_elem[0].text_content().strip(),
"url": response.url,
}
Common settings:
CONCURRENCY- Number of concurrent requestsDELAY_PER_DOMAIN- Delay between requests to same domain (seconds)USER_AGENT- Browser user agent stringTIMEOUT- Request timeout in secondsMAX_RETRIES- Retry failed requests
For a full list of settings, see the settings reference.
Handling errors
Add error handling for robust spiders:
from cssselect import SelectorError
class QuotesSpider(Spider):
name = "quotes_spider"
start_urls = ["https://quotes.toscrape.com/"]
async def parse(self, response):
# Check response status
if response.status_code != 200:
self.logger.warning(f"Non-200 status: {response.status_code}")
return
rv = self.response_view(response)
# Safe selector usage
try:
quotes = rv.doc.cssselect('div.quote')
except SelectorError as e:
self.logger.error(f"Invalid selector: {e}")
return
if not quotes:
self.logger.info("No quotes found on page")
return
for quote in quotes:
try:
text = quote.cssselect('span.text')[0].text_content().strip()
author = quote.cssselect('small.author')[0].text_content().strip()
yield {
"text": text,
"author": author,
"url": response.url,
}
except (IndexError, AttributeError) as e:
# Skip malformed quotes
self.logger.debug(f"Skipping quote due to: {e}")
continue
Error handling patterns:
- Check
response.status_codebefore processing - Use try/except for selector operations
- Validate elements exist before accessing
- Log warnings/errors for debugging
- Use
continueto skip bad items,returnto skip page
For advanced error handling, see error recovery.
Following links and pagination
Crawl multiple pages by following links:
class QuotesSpider(Spider):
name = "quotes_spider"
start_urls = ["https://quotes.toscrape.com/"]
async def parse(self, response):
rv = self.response_view(response)
# Extract items from current page
for quote in rv.doc.cssselect('div.quote'):
text_elem = quote.cssselect('span.text')
author_elem = quote.cssselect('small.author')
if text_elem and author_elem:
yield {
"text": text_elem[0].text_content().strip(),
"author": author_elem[0].text_content().strip(),
"url": response.url,
}
# Follow "next" link for pagination
next_links = rv.doc.cssselect('li.next a')
if next_links:
href = next_links[0].get('href')
if href:
# rv.follow() resolves relative URLs and preserves context
yield rv.follow(href)
Link following patterns:
rv.follow(href)- Follow relative or absolute URLself.follow(response, href)- Alternative syntax- Yields new requests back to the scheduler
- Same
parse()method handles all pages
For advanced pagination techniques, see pagination strategies.
Exporting data
Save scraped data to files:
# Export to JSON
qcrawl --export quotes.json --export-format json quotes_spider.py:QuotesSpider
# Export to CSV
qcrawl --export quotes.csv --export-format csv quotes_spider.py:QuotesSpider
# Export to XML
qcrawl --export quotes.xml --export-format xml quotes_spider.py:QuotesSpider
Export formats:
json- JSON array of itemsndjson- One JSON object per line (NDJSON format)csv- CSV with headersxml- XML format
Complete example
Putting it all together:
from cssselect import SelectorError
from qcrawl.core.spider import Spider
from qcrawl.core.item import Item
class QuotesSpider(Spider):
name = "quotes_spider"
start_urls = ["https://quotes.toscrape.com/"]
custom_settings = {
"CONCURRENCY": 5,
"DELAY_PER_DOMAIN": 0.5,
"USER_AGENT": "QuoteBot/1.0",
"MAX_DEPTH": 0, # Unlimited depth
}
async def parse(self, response):
"""Extract quotes and follow pagination."""
# Validate response
if response.status_code != 200:
self.logger.warning(f"Non-200 status: {response.status_code}")
return
rv = self.response_view(response)
# Safe selector usage
try:
quotes = rv.doc.cssselect('div.quote')
except SelectorError as e:
self.logger.error(f"Selector error: {e}")
return
# Extract quotes
for quote in quotes:
try:
text = quote.cssselect('span.text')[0].text_content().strip()
author = quote.cssselect('small.author')[0].text_content().strip()
tags = [t.text_content().strip()
for t in quote.cssselect('div.tags a.tag')]
yield Item(
data={
"text": text,
"author": author,
"tags": tags,
"url": response.url,
},
metadata={"source": "quotes.toscrape.com"}
)
except (IndexError, AttributeError) as e:
self.logger.debug(f"Skipping malformed quote: {e}")
continue
# Follow pagination
next_links = rv.doc.cssselect('li.next a')
if next_links:
href = next_links[0].get('href')
if href:
yield rv.follow(href)
Run with export:
qcrawl --export quotes.json --export-format json quotes_spider.py:QuotesSpider
What’s next?
Ready to get started? Install qCrawl, review examples, and join the community on Discord for support.
Thanks for your interest!