Selectors
When scraping web pages, the primary task is extracting data from the documents. qCrawl uses lxml as the default HTML parser because of its speed and native XPath support.
Using selectors
I will assume you have a response view object rv created from a response:
async def parse(self, response):
rv = self.response_view(response)
tree = rv.doc # lxml.html.HtmlElement
Basic Examples
# Element with id="main"
tree.cssselect('#main')
# Element with class="highlight" (note: class is multi-valued)
tree.cssselect('.highlight')
# <div> with class="post"
tree.cssselect('div.post')
# <a> tags with href attribute containing "login"
tree.cssselect('a[href*="login"]')
# Direct child: <ul> > <li>
tree.cssselect('ul > li')
# Any descendant: <ul> <li>
tree.cssselect('ul li')
Common Patterns
# All text inside .article-body
body_text = tree.cssselect('.article-body')[0].text_content()
# All image srcs
images = [img.get('src') for img in tree.cssselect('img') if img.get('src')]
# Product prices (common patterns)
prices = tree.cssselect('.price, .product-price, [itemprop="price"]')
# OpenGraph / meta tags
meta_title = tree.cssselect('meta[property="og:title"]')[0].get('content')
Advanced Examples
Selecting by Multiple Classes (AND logic)
example.html
<div class="post featured important">...</div>
# Matches elements that have ALL three classes
tree.cssselect('.post.featured.important')
# No space between dots = AND
Case-Insensitive Attribute Matching (CSS4)
# HTML5 allows this in selectors
tree.cssselect('[rel="NoFollow" i]') # matches "nofollow", "NoFollow", etc.
Using :has() for Complex Relationships
# All articles that contain an image
articles_with_images = tree.cssselect('article:has(img)')
# All comments that contain a reply form
tree.cssselect('.comment:has(> form)')
Using :is() to Reduce Repetition
# Instead of repeating header selectors
headers = tree.cssselect('h1.title, h2.title, h3.title, h4.title')
# Cleaner with :is()
headers = tree.cssselect(':is(h1,h2,h3,h4).title')
Form Inputs by Type
tree.cssselect('input[type="text"]')
tree.cssselect('input[type="checkbox"]:checked') # Note: :checked works
Performance Tips
-
Be as specific as possible from the left
# Fast tree.cssselect('div#content p.title') # Slower (scans entire document) tree.cssselect('p.title') -
Use direct child > when possible
example.html<article class="product"> <div class="title">Awesome Gadget</div> <div class="price">$99.99</div> <div class="reviews"> <span class="rating">4.8</span> <div class="title">Customer Reviews</div> <!-- nested title! --> </div> </article> <article class="product"> ... another product ... </article># This will accidentally match the "Customer Reviews" title too! titles = tree.cssselect('.product .title') # Only matches <div class="title"> that is a DIRECT child of .product titles = tree.cssselect('.product > .title') -
For repeated selections, compile the selector once:
from cssselect import GenericTranslator from qcrawl.core.spider import Spider from qcrawl.core.item import Item # compile once (module/class level) for reuse _XPATH_LINKS = GenericTranslator().css_to_xpath("article.post > a.title") class ExampleSpider(Spider): name = "example" start_urls = ["https://example.com/"] async def parse(self, response): rv = self.response_view(response) tree = rv.doc # lxml.html.HtmlElement for a in tree.xpath(_XPATH_LINKS): href = a.get("href") text = a.text_content().strip() if a is not None else "" if not href: continue yield Item(data={"url": rv.urljoin(href), "title": text})
Common Pitfalls & Gotchas
| Issue | Solution |
|---|---|
| Namespaces (e.g., XHTML, SVG) | Choose HTML vs XML mode deliberately: use html.fromstring() for HTML (lenient parsing), use etree.fromstring() (or etree.XMLParser) for XML with namespace-aware XPath. Pass namespaces mapping to .xpath() when needed. |
| Dynamic content (JavaScript-loaded) | lxml does not execute JavaScript. Render with a browser automation tool (Selenium, Playwright), then pass the rendered HTML to lxml for parsing. |
| Anti-scraping (obfuscated class names) | Prefer structure- or attribute-based selectors (tags, ARIA roles, data- attributes, text), use heuristics (position, parent/child relationships), or render+interact via a headless browser. Rotate user-agents / IPs and respect robots.txt and terms. |
Notes
- For mixed XML/HTML sources (RSS, Atom, SVG embedded in HTML), parse with the appropriate parser and use
namespacesin XPath queries. - When using browser rendering, capture the final page HTML and call
html.fromstring(rendered_html)to continue using lxml selectors. - Keep selectors resilient: avoid relying solely on ephemeral class names; prefer semantic attributes when possible.
Supported CSS Selectors
| Selector | Example | Description |
|---|---|---|
| Type selector | div |
All <div> elements |
| Universal | * |
All elements |
| Class | .warning |
Any element with class containing "warning" |
| ID | #header |
Element with id="header" |
| Attribute (exact) | [href="https://example.com"] |
Exact match |
| Attribute (whitespace-separated) | [class~="special"] |
Class contains word "special" |
| Attribute (starts with) | [href^="https://"] |
href begins with https:// |
| Attribute (ends with) | [href$=".pdf"] |
href ends with .pdf |
| Attribute (contains) | [href*="example"] |
href contains "example" |
| Child combinator | div > p |
Direct children only |
| Descendant combinator | div p |
Any <p> inside <div> |
| Adjacent sibling | h2 + p |
<p> immediately after <h2> |
| General sibling | h2 ~ p |
Any <p> after <h2> (not necessarily direct) |
:first-child |
li:first-child |
First <li> in its parent |
:last-child |
li:last-child |
Last <li> in its parent |
:nth-child(n) |
tr:nth-child(odd) |
Odd rows (or even/number expressions) |
:nth-child(an+b) |
li:nth-child(3n+1) |
1st, 4th, 7th... |
:nth-last-child() |
li:nth-last-child(2) |
Second-to-last <li> |
:only-child |
p:only-child |
<p> that is the only child |
:empty |
td:empty |
Elements with no children/text |
:not() |
a:not([href]) |
<a> without href |
:has() (CSS4) |
div:has(> img) |
<div> that directly contains an <img> |
:is() / :where() |
section :is(h1, h2, h3) |
Any h1/h2/h3 inside section |