feat: add PlasmateLoader as lightweight scraping backend (no Chrome needed)#1062
Conversation
…eeded) Closes ScrapeGraphAI#1055 Plasmate (https://github.com/plasmate-labs/plasmate) is an open-source Rust browser engine that outputs Structured Object Model (SOM) instead of raw HTML. It requires no Chrome process, uses ~64MB RAM per session vs ~300MB, and delivers 10-100x fewer tokens per page. Changes: - Add scrapegraphai/docloaders/plasmate.py: PlasmateLoader - Implements BaseLoader (lazy_load + alazy_load) - Calls plasmate binary via subprocess (pip install plasmate) - Supports output_format: 'text' (default), 'som', 'markdown', 'links' - Supports --selector, --header, --timeout flags - Optional fallback_to_chrome=True for JS-heavy SPAs - Async-safe: runs subprocess in executor thread pool - Update scrapegraphai/docloaders/__init__.py: export PlasmateLoader - Update scrapegraphai/nodes/fetch_node.py: support plasmate config dict in FetchNode (alongside browser_base and scrape_do) - Add tests/test_plasmate.py: 25 unit tests (init, cmd building, lazy_load, alazy_load, fallback, error handling) Usage: from scrapegraphai.docloaders import PlasmateLoader loader = PlasmateLoader( urls=['https://docs.python.org/3/library/json.html'], output_format='text', timeout=30, fallback_to_chrome=True, # optional: retry with Chrome for SPAs ) docs = loader.load() # Or via FetchNode config: graph_config = { 'plasmate': { 'output_format': 'text', 'timeout': 30, 'fallback_to_chrome': False, } }
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9dd1fb54ed
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| loader = PlasmateLoader( | ||
| [source], | ||
| output_format=plasmate_cfg.get("output_format", "text"), | ||
| timeout=plasmate_cfg.get("timeout", self.timeout or 30), | ||
| selector=plasmate_cfg.get("selector"), |
There was a problem hiding this comment.
Forward Chromium options into Plasmate fallback
In FetchNode.handle_web_source, the PlasmateLoader instance is built without forwarding Chromium settings (headless, storage_state, loader_kwargs) as chrome_kwargs. When plasmate.fallback_to_chrome=True and Plasmate returns empty content (common on JS-heavy/authenticated pages), _fallback_fetch() creates ChromiumLoader with defaults, so fallback silently ignores configured auth/proxy/timeout behavior and can fetch incorrect content or hang longer than expected. Pass the existing Chromium options through this constructor to keep fallback behavior consistent with the normal Chromium path.
Useful? React with 👍 / 👎.
| with patch( | ||
| "scrapegraphai.docloaders.plasmate.ChromiumLoader", | ||
| return_value=mock_chrome_loader, |
There was a problem hiding this comment.
Patch ChromiumLoader at the correct import site in tests
This test patches scrapegraphai.docloaders.plasmate.ChromiumLoader, but ChromiumLoader is imported inside _fallback_fetch and is not a module-level attribute in plasmate.py. As written, patch(...) raises AttributeError before the test logic runs (same pattern appears again in test_no_fallback_when_content_present), so the new fallback tests fail instead of validating behavior. Patch the symbol where it is actually resolved (or expose it at module scope).
Useful? React with 👍 / 👎.
|
🎉 This PR is included in version 1.76.0 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
Closes #1055
What this adds
Plasmate is an open-source Rust browser engine that outputs Structured Object Model (SOM) instead of raw HTML. No Playwright, no Chrome process, single binary installable via
pip install plasmate.This PR adds
PlasmateLoaderas an alternative fetcher for static / server-rendered pages, with Chrome as optional fallback for SPAs.Changes
scrapegraphai/docloaders/plasmate.py—PlasmateLoaderimplementingBaseLoaderscrapegraphai/docloaders/__init__.py— exportsPlasmateLoaderscrapegraphai/nodes/fetch_node.py— supportsplasmateconfig dict inFetchNode(alongsidebrowser_baseandscrape_do)tests/test_plasmate.py— 25 unit testsUsage
Or via
SmartScraperGraphconfig:Why it's better for static pages
pip install playwright && playwright installpip install plasmatefallback_to_chrome=Truefor SPAs)Compression ratios measured across 45 real sites: average 17.7×, peak 77.5× (TechCrunch). Fewer tokens = lower LLM costs for AI-powered extraction.
Notes
fallback_to_chrome=Truemakes it safe to use as a drop-in for mixed static/SPA workloadsChromiumLoaderusage is untouched