This directory contains comprehensive tests for the Scrapfly Crawler API functionality.
Comprehensive test suite for the Crawler API covering:
- Basic Workflow: Start, monitor, and retrieve crawl results
- Status Monitoring: Polling, caching, and status checking
- WARC Artifacts: Download, parse, iterate through records
- HAR Artifacts: Download, parse HAR format with timing info
- Content Formats: HTML, markdown, text, JSON, extracted data
- Content Retrieval:
read(),read_iter(),read_batch() - Configuration: Page limits, depth, path filtering
- Statistics: Crawl stats and metrics
- Error Handling: Edge cases and error scenarios
- Async Methods: Async crawler operations
- Web-scraping.dev Tests: Tests using the web-scraping.dev test site
Install test dependencies:
pip install pytest pytest-asyncioSet environment variables (optional):
export SCRAPFLY_KEY="your-api-key"
export SCRAPFLY_API_HOST="https://api.scrapfly.home"# Run all crawler tests
pytest tests/test_crawler.py -v
# Run with output
pytest tests/test_crawler.py -v -s
# Run specific test class
pytest tests/test_crawler.py::TestCrawlerBasicWorkflow -v
# Run specific test
pytest tests/test_crawler.py::TestCrawlerBasicWorkflow::test_basic_crawl_workflow -v# Basic workflow tests
pytest tests/test_crawler.py::TestCrawlerBasicWorkflow -v
# WARC tests
pytest tests/test_crawler.py::TestCrawlerWARC -v
# HAR tests
pytest tests/test_crawler.py::TestCrawlerHAR -v
# Content format tests
pytest tests/test_crawler.py::TestContentFormats -v
# Async tests
pytest tests/test_crawler.py::TestAsyncCrawler -v
# Error handling tests
pytest tests/test_crawler.py::TestErrorHandling -vpip install pytest-cov
pytest tests/test_crawler.py --cov=scrapfly --cov-report=htmlThe tests use the following test sites:
-
https://web-scraping.dev - Primary test site designed for web scraping practice
/products- Product listing with pagination/product/{id}- Product detail pages- Ideal for testing crawling, pagination, path filtering
- Specifically created for testing web scraping tools
-
https://httpbin.dev - HTTP testing service
/status/{code}- Returns specific HTTP status codes- Used for testing error handling (404, 503, etc.)
- Homepage has docs about available endpoints
The test suite covers:
- ✅ Starting and stopping crawls
- ✅ Status monitoring and polling
- ✅ WARC artifact parsing
- ✅ HAR artifact parsing with timing data
- ✅ Multiple content formats (HTML, markdown, text, JSON)
- ✅ Content retrieval methods (read, read_iter, read_batch)
- ✅ Path filtering (exclude_paths, include_only_paths)
- ✅ Page limits and depth limits
- ✅ Pattern matching for URL filtering
- ✅ Batch content retrieval (up to 100 URLs)
- ✅ Error handling and edge cases
- ✅ Async operations
- ✅ Crawler statistics
- ✅ Failed crawls (503 errors, etc.)
- ✅ Method chaining
- ✅ Caching behavior
When adding new tests:
- Use the
clientfixture for ScrapflyClient instances - Use the
test_urlfixture for the default test URL - Keep tests focused and independent
- Use descriptive test names
- Add docstrings to explain what's being tested
- Clean up any resources (though crawls are stateless)
Example:
def test_new_feature(self, client, test_url):
"""Test the new feature XYZ"""
config = CrawlerConfig(url=test_url, page_limit=3)
crawl = Crawl(client, config).crawl().wait()
# Test assertions
assert crawl.started
assert crawl.uuid is not None- Reduce
page_limitin test configs - Use smaller test sites
- Run specific test classes instead of all tests
- Check that
SCRAPFLY_KEYenvironment variable is set - Verify
SCRAPFLY_API_HOSTis correct - Ensure API is accessible from your network
- Install
pytest-asyncio:pip install pytest-asyncio - Tests marked with
@pytest.mark.asynciorequire this plugin
To run tests in CI/CD pipelines:
# Example GitHub Actions
- name: Run Crawler Tests
env:
SCRAPFLY_KEY: ${{ secrets.SCRAPFLY_KEY }}
SCRAPFLY_API_HOST: ${{ secrets.SCRAPFLY_API_HOST }}
run: |
pip install pytest pytest-asyncio
pytest tests/test_crawler.py -v