Testing
qCrawl follows a practical testing philosophy focused on quality over quantity:
- Test behavior, not implementation - Avoid testing private methods or internal details
- Mock at boundaries - Mock external dependencies only, not internal qCrawl code
- Follow pytest best practices - Use fixtures, parametrization, clear organization, AAA pattern
- High-value integration tests - Cover all critical paths with integration tests
Test Types
Unit Tests (Fast, Isolated)
Located in: tests/
When to write:
- Testing argument parsing, validation logic
- Testing pure functions without I/O
- Testing class initialization and configuration
Example:
def test_parse_args_basic(monkeypatch):
"""parse_args correctly parses CLI arguments."""
monkeypatch.setattr(sys, "argv", ["qcrawl", "spider.py:Spider", "--export", "out.json"])
args = cli.parse_args()
assert args.spider == "spider.py:Spider"
assert args.export == "out.json"
Characteristics:
- Fast (< 100ms each)
- No external dependencies
- Can mock internal components for isolation
- Run frequently during development
Integration Tests (Real Behavior)
Located in: tests/integration/
When to write:
- Testing end-to-end spider execution
- Testing actual HTTP crawling behavior
- Testing export/storage functionality
- Testing middleware chains
Example:
@pytest.mark.integration
@pytest.mark.asyncio
async def test_spider_crawls_real_http(httpbin_server, args_no_export):
"""Spider successfully crawls and parses real HTTP responses."""
class TestSpider(Spider):
name = "test"
start_urls = [f"{httpbin_server}/json"]
async def parse(self, response):
data = response.json()
yield {"title": data.get("slideshow", {}).get("title")}
# Run against REAL HTTP server (Docker container)
await run(TestSpider, args_no_export, spider_settings, runtime_settings)
# Verify actual behavior (output written to stdout)
Characteristics:
- Slower (requires Docker containers)
- Tests against real services (HTTP servers, Redis)
- NO mocking of internal qCrawl components
- Run before commits/PRs
Mocking Strategy: "Mock at Boundaries"
DO Mock (External Dependencies)
Mock things you don't control:
# Mock HTTP client (external dependency)
with patch("qcrawl.core.downloader.aiohttp.ClientSession"):
...
# Mock Redis (external service)
with patch("redis.asyncio.Redis"):
...
# Mock file system for unit tests
with patch("pathlib.Path.write_text"):
...
# Mock sys.argv for CLI tests
monkeypatch.setattr(sys, "argv", ["qcrawl", "spider:Spider"])
DON'T Mock (Internal Components)
Don't mock things you control and want to test:
# BAD - Don't mock internal qCrawl components
with patch("qcrawl.core.spider.Spider"): # NO
...
with patch("qcrawl.core.crawler.Crawler"): # NO
...
with patch("qcrawl.core.engine.CrawlEngine"): # NO
...
# GOOD - Let internal components run naturally
spider = MySpider()
await crawler.crawl() # Real execution
Why? Mocking internal components makes tests brittle and misses integration bugs. Integration tests should test actual behavior.
Using Docker with Testcontainers
For integration tests, use real services via Docker instead of mocking.
Setup
Install qCrawl with dev dependencies (includes testcontainers, pytest, and other test tools):
pip install -e ".[dev]" # For local development
# OR
pip install qcrawl[dev] # From PyPI
This installs all development dependencies including:
testcontainers- Docker containers for testingpytest,pytest-asyncio,pytest-cov- Testing frameworkruff,mypy- Linting and type checkingqcrawl[redis],qcrawl[camoufox]- Optional qCrawl features
HTTP Server Fixture
import pytest
from testcontainers.core.container import DockerContainer
@pytest.fixture(scope="module")
def httpbin_server():
"""Start httpbin container for testing."""
container = DockerContainer("kennethreitz/httpbin:latest")
container.with_exposed_ports(80)
container.start()
host = container.get_container_host_ip()
port = container.get_exposed_port(80)
yield f"http://{host}:{port}"
container.stop()
Using in Tests
@pytest.mark.integration
@pytest.mark.asyncio
async def test_spider_against_real_http(httpbin_server):
"""Test spider against real HTTP server - NO MOCKING."""
class JsonSpider(Spider):
name = "json"
start_urls = [f"{httpbin_server}/json"]
async def parse(self, response):
yield response.json()
# Runs against REAL HTTP server in Docker
await run(JsonSpider, args, spider_settings, runtime_settings)
Benefits:
- Tests actual HTTP behavior (redirects, timeouts, headers)
- Catches real integration bugs
- More confidence than mocked tests
Best Practices
1. Use Pytest Fixtures
# GOOD - Use fixtures
@pytest.fixture
def sample_spider():
return MySpider()
def test_spider_init(sample_spider):
assert sample_spider.name == "my_spider"
# BAD - Avoid class-based setup
class TestSpider: # Don't do this
def setup_method(self):
self.spider = MySpider()
2. Parametrize for Multiple Scenarios
@pytest.mark.parametrize("export_value,expected_output", [
(None, "stdout"),
("-", "stdout"),
("file.json", "file"),
])
def test_export_variations(export_value, expected_output):
args = argparse.Namespace(export=export_value)
result = determine_output(args)
assert result == expected_output
3. Follow AAA Pattern
def test_spider_extracts_data():
# Arrange
spider = MySpider()
response = create_test_response()
# Act
items = list(spider.parse(response))
# Assert
assert len(items) == 2
assert items[0]["title"] == "Test"
4. Use Clear Assertions with Messages
# GOOD - Clear assertion messages
assert len(items) > 0, "Should extract at least one item"
assert "title" in item, "Item should have title field"
# BAD - No context on failure
assert len(items) > 0
assert "title" in item
5. Test Public API Only
# GOOD - Test public methods
def test_spider_parse():
spider = MySpider()
items = spider.parse(response)
assert items
# BAD - Don't test private methods
def test_spider_internal_helper():
spider = MySpider()
result = spider._internal_helper() # Private method
Running Tests
Run All Tests
pytest
Run Unit Tests Only (Fast)
pytest -m "not integration"
Run Integration Tests Only
pytest -m integration
# OR
pytest tests/integration/
Run Specific Test
pytest tests/test_cli.py::test_parse_args_basic -v
Run with Coverage
pytest --cov=qcrawl --cov-report=term-missing
Skip Slow Tests
pytest -m "not integration" --maxfail=1 # Fast feedback
Test Organization
tests/
├── core/
│ ├── queues/
│ │ ├── test_factory.py
│ │ └── test_memory_queue.py
│ ├── conftest.py # Core-specific fixtures
│ ├── test_spider.py
│ ├── test_engine.py
│ └── ...
├── downloaders/
│ ├── test_downloader.py
│ ├── test_camoufox.py
│ └── ...
├── middleware/
│ ├── downloader/
│ │ ├── test_retry.py
│ │ ├── test_cookies.py
│ │ └── ...
│ ├── spider/
│ │ ├── test_depth.py
│ │ ├── test_offsite.py
│ │ └── ...
│ ├── conftest.py # Middleware fixtures
│ └── test_manager.py
├── pipelines/
│ ├── test_manager.py
│ ├── test_validation.py
│ └── ...
├── runner/
│ ├── test_export.py
│ ├── test_run.py
│ └── ...
├── utils/
│ ├── test_url.py
│ ├── test_fingerprint.py
│ └── ...
├── integration/ # Require Docker
│ ├── test_runner.py
│ ├── test_camoufox.py
│ └── test_redis_queue.py
├── conftest.py # Shared fixtures
├── test_cli.py
└── ...
Section Comments
Organize tests with clear section comments:
# Initialization Tests
def test_spider_init_valid():
"""Spider initializes with valid parameters."""
...
def test_spider_init_invalid():
"""Spider raises error with invalid parameters."""
...
# Parsing Tests
def test_parse_html():
"""Spider parses HTML correctly."""
...
# Error Handling Tests
def test_parse_handles_missing_elements():
"""Spider gracefully handles missing HTML elements."""
...