Skip to content

Crawling ordering

The scheduler processes requests based on priority values. Higher priority requests are processed first. By adjusting priorities, you can control whether your crawler explores pages breadth-first (level by level), depth-first (following paths deeply), or with custom focus on specific content.

Breadth-first crawl (default)

All same-depth URLs have equal priority:

from qcrawl.core.spider import Spider
from qcrawl.core.request import Request

class BreadthFirstSpider(Spider):
    name = "breadth_first"
    start_urls = ["https://example.com"]

    async def parse(self, response):
        rv = self.response_view(response)

        # All links get same priority (processed in order discovered)
        for link in rv.doc.cssselect("a"):
            href = link.get("href")
            if href:
                yield rv.follow(href, priority=0)

Use cases:

  • Discovering all pages at each level before going deeper
  • Site mapping and structure discovery
  • When order doesn't matter

Depth-first crawl

Prioritize deeper pages by increasing priority with depth:

async def parse(self, response):
    rv = self.response_view(response)

    current_depth = response.request.meta.get("depth", 0)
    next_priority = current_depth  # Higher depth = higher priority

    for link in rv.doc.cssselect("a"):
        href = link.get("href")
        if href:
            yield rv.follow(
                href,
                priority=next_priority,
                meta={"depth": current_depth + 1}
            )

Use cases:

  • Following specific content paths deeply
  • Getting to target pages quickly
  • Exploring hierarchical structures

Focused crawling

Prioritize specific content types or URL patterns:

async def parse(self, response):
    rv = self.response_view(response)

    # High priority for target content
    if "product" in response.url:
        for link in rv.doc.cssselect("a.product"):
            yield rv.follow(link.get("href"), priority=100)

    # Low priority for other pages
    else:
        for link in rv.doc.cssselect("a"):
            yield rv.follow(link.get("href"), priority=1)

Use cases:

  • Prioritizing valuable content
  • Targeted data extraction
  • Efficient resource usage

Combining with depth limits

Control crawl depth using settings:

class MySpider(Spider):
    name = "limited_depth"
    start_urls = ["https://example.com"]

    custom_settings = {
        "MAX_DEPTH": 3,  # Stop after 3 levels
    }

    async def parse(self, response):
        rv = self.response_view(response)

        for link in rv.doc.cssselect("a"):
            yield rv.follow(link.get("href"))

Best practices

  • Use appropriate crawl order: Choose breadth-first, depth-first, or focused based on your needs
  • Use priority sparingly: Most requests should be priority 0; reserve high priority for critical paths
  • Track depth with meta: Monitor crawl depth to prevent excessive nesting
  • Set MAX_DEPTH: Limit crawl depth to prevent runaway crawls
  • Document your strategy: Comment why certain priorities are set
  • Test incrementally: Verify crawl order matches expectations with small test runs

See also: Link Filtering, Pagination, Scheduler