Exporters
After a spider extracts data from web-page it's often necessary to store that data in a structured format for further analysis or processing.
qCrawl provides exporters that serialize scraped items and write them (streamed or buffered) to a destination. They can be used via the CLI or programmatically.
Format comparison:
| Format | Streaming | Best For | File Size |
|---|---|---|---|
| NDJSON | ✅ Yes | Large datasets, streaming pipelines | Medium |
| JSON | ❌ No | API responses, small datasets | Medium |
| CSV | ✅ Yes | Excel, data analysis, flat data | Small |
| XML | ❌ No | Legacy systems, SOAP APIs | Large |
Examples - CLI usage
JSON Lines (default)
# One JSON object per line (streaming-friendly, memory efficient)
qcrawl quotes_css_spider.py --export output.jsonl
# Or explicitly specify format
qcrawl quotes_css_spider.py --export output.jsonl --export-format ndjson
Output (output.jsonl):
{"url": "https://quotes.toscrape.com/", "text": "The world as we...", "author": "Albert Einstein", "tags": ["change", "world"]}
{"url": "https://quotes.toscrape.com/", "text": "It is our choices...", "author": "J.K. Rowling", "tags": ["abilities", "choices"]}
JSON (full array)
# All items in a single JSON array
qcrawl quotes_css_spider.py --export output.json --export-format json
Output (output.json):
[
{
"url": "https://quotes.toscrape.com/",
"text": "The world as we...",
"author": "Albert Einstein",
"tags": ["change", "world"]
},
{
"url": "https://quotes.toscrape.com/",
"text": "It is our choices...",
"author": "J.K. Rowling",
"tags": ["abilities", "choices"]
}
]
CSV
# Comma-separated values (works best with flat data structures)
qcrawl quotes_css_spider.py --export output.csv --export-format csv
Output (output.csv):
url,text,author,tags
https://quotes.toscrape.com/,"The world as we...",Albert Einstein,"['change', 'world']"
https://quotes.toscrape.com/,"It is our choices...",J.K. Rowling,"['abilities', 'choices']"
XML
# XML format
qcrawl quotes_css_spider.py --export output.xml --export-format xml
Output (output.xml):
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<url>https://quotes.toscrape.com/</url>
<text>The world as we...</text>
<author>Albert Einstein</author>
<tags>
<value>change</value>
<value>world</value>
</tags>
</item>
</items>
Streaming vs Buffered Mode
# Buffered (default): Collects items in memory, writes at end
qcrawl quotes_css_spider.py --export output.json --export-mode buffered
# Streaming: Writes items immediately as they're scraped (memory efficient)
qcrawl quotes_css_spider.py --export output.jsonl --export-mode stream
# Adjust buffer size (items to collect before writing)
qcrawl quotes_css_spider.py --export output.jsonl --export-buffer-size 1000
Examples - Programmatic usage
# Async usage (recommended)
import asyncio
from quotes_spider import QuotesSpider # replace with your spider
from qcrawl.runner import SpiderRunner
runner = SpiderRunner(
settings={
"export": "exports/quotes.ndjson",
"export_format": "ndjson",
"export_mode": "buffered",
"export_buffer_size": 500,
"log_level": "INFO",
"concurrency": 50,
}
)
async def main() -> None:
# Await the async entrypoint from your event loop
await runner.crawl(QuotesSpider)
if __name__ == "__main__":
asyncio.run(main())
# Synchronous convenience (for simple scripts)
from quotes_spider import QuotesSpider # replace with your spider
from qcrawl.runner import SpiderRunner
runner = SpiderRunner(
settings={
"export": "exports/quotes.ndjson",
"export_format": "ndjson",
"log_level": "INFO",
"concurrency": 50,
}
)
# Blocks and is implemented via asyncio.run(); raises if an event loop is already running.
runner.crawl_sync(QuotesSpider)