feat: add PlasmateLoader as lightweight scraping backend (no Chrome needed) by dbhurley · Pull Request #1062 · ScrapeGraphAI/Scrapegraph-ai

dbhurley · 2026-04-08T10:58:24Z

What this adds

Plasmate is an open-source Rust browser engine that outputs Structured Object Model (SOM) instead of raw HTML. No Playwright, no Chrome process, single binary installable via pip install plasmate.

This PR adds PlasmateLoader as an alternative fetcher for static / server-rendered pages, with Chrome as optional fallback for SPAs.

Changes

scrapegraphai/docloaders/plasmate.py — PlasmateLoader implementing BaseLoader
scrapegraphai/docloaders/__init__.py — exports PlasmateLoader
scrapegraphai/nodes/fetch_node.py — supports plasmate config dict in FetchNode (alongside browser_base and scrape_do)
tests/test_plasmate.py — 25 unit tests

Usage

from scrapegraphai.docloaders import PlasmateLoader

loader = PlasmateLoader(
    urls=['https://docs.python.org/3/library/json.html'],
    output_format='text',   # 'text' | 'som' | 'markdown' | 'links'
    timeout=30,
    fallback_to_chrome=True,  # retry with Chromium for JS-heavy SPAs
)
docs = loader.load()

Or via SmartScraperGraph config:

graph_config = {
    'llm': {...},
    'plasmate': {
        'output_format': 'text',
        'timeout': 30,
        'fallback_to_chrome': False,
    }
}

Why it's better for static pages

	ChromiumLoader	PlasmateLoader
RAM per session	~300MB	~64MB
Tokens per page (avg)	~75,000 (raw HTML)	~4,200 (SOM)
Chrome/Playwright required	Yes	No
Install	`pip install playwright && playwright install`	`pip install plasmate`
JS rendering	Yes	No (use `fallback_to_chrome=True` for SPAs)

Compression ratios measured across 45 real sites: average 17.7×, peak 77.5× (TechCrunch). Fewer tokens = lower LLM costs for AI-powered extraction.

Notes

Plasmate is Apache 2.0, free and open source
fallback_to_chrome=True makes it safe to use as a drop-in for mixed static/SPA workloads
No breaking changes — existing ChromiumLoader usage is untouched

…eeded) Closes ScrapeGraphAI#1055 Plasmate (https://github.com/plasmate-labs/plasmate) is an open-source Rust browser engine that outputs Structured Object Model (SOM) instead of raw HTML. It requires no Chrome process, uses ~64MB RAM per session vs ~300MB, and delivers 10-100x fewer tokens per page. Changes: - Add scrapegraphai/docloaders/plasmate.py: PlasmateLoader - Implements BaseLoader (lazy_load + alazy_load) - Calls plasmate binary via subprocess (pip install plasmate) - Supports output_format: 'text' (default), 'som', 'markdown', 'links' - Supports --selector, --header, --timeout flags - Optional fallback_to_chrome=True for JS-heavy SPAs - Async-safe: runs subprocess in executor thread pool - Update scrapegraphai/docloaders/__init__.py: export PlasmateLoader - Update scrapegraphai/nodes/fetch_node.py: support plasmate config dict in FetchNode (alongside browser_base and scrape_do) - Add tests/test_plasmate.py: 25 unit tests (init, cmd building, lazy_load, alazy_load, fallback, error handling) Usage: from scrapegraphai.docloaders import PlasmateLoader loader = PlasmateLoader( urls=['https://docs.python.org/3/library/json.html'], output_format='text', timeout=30, fallback_to_chrome=True, # optional: retry with Chrome for SPAs ) docs = loader.load() # Or via FetchNode config: graph_config = { 'plasmate': { 'output_format': 'text', 'timeout': 30, 'fallback_to_chrome': False, } }

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9dd1fb54ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T11:01:56Z

+                loader = PlasmateLoader(
+                    [source],
+                    output_format=plasmate_cfg.get("output_format", "text"),
+                    timeout=plasmate_cfg.get("timeout", self.timeout or 30),
+                    selector=plasmate_cfg.get("selector"),


Forward Chromium options into Plasmate fallback

In FetchNode.handle_web_source, the PlasmateLoader instance is built without forwarding Chromium settings (headless, storage_state, loader_kwargs) as chrome_kwargs. When plasmate.fallback_to_chrome=True and Plasmate returns empty content (common on JS-heavy/authenticated pages), _fallback_fetch() creates ChromiumLoader with defaults, so fallback silently ignores configured auth/proxy/timeout behavior and can fetch incorrect content or hang longer than expected. Pass the existing Chromium options through this constructor to keep fallback behavior consistent with the normal Chromium path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-08T11:01:56Z

+    with patch(
+        "scrapegraphai.docloaders.plasmate.ChromiumLoader",
+        return_value=mock_chrome_loader,


Patch ChromiumLoader at the correct import site in tests

This test patches scrapegraphai.docloaders.plasmate.ChromiumLoader, but ChromiumLoader is imported inside _fallback_fetch and is not a module-level attribute in plasmate.py. As written, patch(...) raises AttributeError before the test logic runs (same pattern appears again in test_no_fallback_when_content_present), so the new fallback tests fail instead of validating behavior. Patch the symbol where it is actually resolved (or expose it at module scope).

Useful? React with 👍 / 👎.

github-actions · 2026-04-09T09:41:10Z

🎉 This PR is included in version 1.76.0 🎉

The release is available on:

v1.76.0
GitHub release

Your semantic-release bot 📦🚀

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Apr 8, 2026

dbhurley mentioned this pull request Apr 8, 2026

Plasmate as a lightweight scraping backend - no Chrome needed #1055

Closed

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

VinciGit00 merged commit 1238738 into ScrapeGraphAI:main Apr 9, 2026
1 check passed

github-actions bot added the released on @stable label Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add PlasmateLoader as lightweight scraping backend (no Chrome needed)#1062

feat: add PlasmateLoader as lightweight scraping backend (no Chrome needed)#1062
VinciGit00 merged 1 commit intoScrapeGraphAI:mainfrom
dbhurley:feat/plasmate-loader

dbhurley commented Apr 8, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Uh oh!

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dbhurley commented Apr 8, 2026

What this adds

Changes

Usage

Why it's better for static pages

Notes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants