Crawling ordering

The scheduler processes requests based on priority values. Higher priority requests are processed first. By adjusting priorities, you can control whether your crawler explores pages breadth-first (level by level), depth-first (following paths deeply), or with custom focus on specific content.

Breadth-first crawl (default)

All same-depth URLs have equal priority:

from qcrawl.core.spider import Spider
from qcrawl.core.request import Request

class BreadthFirstSpider(Spider):
    name = "breadth_first"
    start_urls = ["https://example.com"]

    async def parse(self, response):
        rv = self.response_view(response)

        # All links get same priority (processed in order discovered)
        for link in rv.doc.cssselect("a"):
            href = link.get("href")
            if href:
                yield rv.follow(href, priority=0)

Use cases:

Discovering all pages at each level before going deeper
Site mapping and structure discovery
When order doesn't matter

Depth-first crawl

Prioritize deeper pages by increasing priority with depth:

async def parse(self, response):
    rv = self.response_view(response)

    current_depth = response.request.meta.get("depth", 0)
    next_priority = current_depth  # Higher depth = higher priority

    for link in rv.doc.cssselect("a"):
        href = link.get("href")
        if href:
            yield rv.follow(
                href,
                priority=next_priority,
                meta={"depth": current_depth + 1}
            )

Use cases:

Following specific content paths deeply
Getting to target pages quickly
Exploring hierarchical structures

Focused crawling

Prioritize specific content types or URL patterns:

async def parse(self, response):
    rv = self.response_view(response)

    # High priority for target content
    if "product" in response.url:
        for link in rv.doc.cssselect("a.product"):
            yield rv.follow(link.get("href"), priority=100)

    # Low priority for other pages
    else:
        for link in rv.doc.cssselect("a"):
            yield rv.follow(link.get("href"), priority=1)

Use cases:

Prioritizing valuable content
Targeted data extraction
Efficient resource usage

Combining with depth limits

Control crawl depth using settings:

class MySpider(Spider):
    name = "limited_depth"
    start_urls = ["https://example.com"]

    custom_settings = {
        "MAX_DEPTH": 3,  # Stop after 3 levels
    }

    async def parse(self, response):
        rv = self.response_view(response)

        for link in rv.doc.cssselect("a"):
            yield rv.follow(link.get("href"))

Best practices

Use appropriate crawl order: Choose breadth-first, depth-first, or focused based on your needs
Use priority sparingly: Most requests should be priority 0; reserve high priority for critical paths
Track depth with meta: Monitor crawl depth to prevent excessive nesting
Set MAX_DEPTH: Limit crawl depth to prevent runaway crawls
Document your strategy: Comment why certain priorities are set
Test incrementally: Verify crawl order matches expectations with small test runs