Skip to content

feat: add PlasmateLoader as lightweight scraping backend (no Chrome needed)#1062

Merged
VinciGit00 merged 1 commit intoScrapeGraphAI:mainfrom
dbhurley:feat/plasmate-loader
Apr 9, 2026
Merged

feat: add PlasmateLoader as lightweight scraping backend (no Chrome needed)#1062
VinciGit00 merged 1 commit intoScrapeGraphAI:mainfrom
dbhurley:feat/plasmate-loader

Conversation

@dbhurley
Copy link
Copy Markdown
Contributor

@dbhurley dbhurley commented Apr 8, 2026

Closes #1055

What this adds

Plasmate is an open-source Rust browser engine that outputs Structured Object Model (SOM) instead of raw HTML. No Playwright, no Chrome process, single binary installable via pip install plasmate.

This PR adds PlasmateLoader as an alternative fetcher for static / server-rendered pages, with Chrome as optional fallback for SPAs.

Changes

  • scrapegraphai/docloaders/plasmate.pyPlasmateLoader implementing BaseLoader
  • scrapegraphai/docloaders/__init__.py — exports PlasmateLoader
  • scrapegraphai/nodes/fetch_node.py — supports plasmate config dict in FetchNode (alongside browser_base and scrape_do)
  • tests/test_plasmate.py — 25 unit tests

Usage

from scrapegraphai.docloaders import PlasmateLoader

loader = PlasmateLoader(
    urls=['https://docs.python.org/3/library/json.html'],
    output_format='text',   # 'text' | 'som' | 'markdown' | 'links'
    timeout=30,
    fallback_to_chrome=True,  # retry with Chromium for JS-heavy SPAs
)
docs = loader.load()

Or via SmartScraperGraph config:

graph_config = {
    'llm': {...},
    'plasmate': {
        'output_format': 'text',
        'timeout': 30,
        'fallback_to_chrome': False,
    }
}

Why it's better for static pages

ChromiumLoader PlasmateLoader
RAM per session ~300MB ~64MB
Tokens per page (avg) ~75,000 (raw HTML) ~4,200 (SOM)
Chrome/Playwright required Yes No
Install pip install playwright && playwright install pip install plasmate
JS rendering Yes No (use fallback_to_chrome=True for SPAs)

Compression ratios measured across 45 real sites: average 17.7×, peak 77.5× (TechCrunch). Fewer tokens = lower LLM costs for AI-powered extraction.

Notes

  • Plasmate is Apache 2.0, free and open source
  • fallback_to_chrome=True makes it safe to use as a drop-in for mixed static/SPA workloads
  • No breaking changes — existing ChromiumLoader usage is untouched

…eeded)

Closes ScrapeGraphAI#1055

Plasmate (https://github.com/plasmate-labs/plasmate) is an open-source
Rust browser engine that outputs Structured Object Model (SOM) instead
of raw HTML. It requires no Chrome process, uses ~64MB RAM per session
vs ~300MB, and delivers 10-100x fewer tokens per page.

Changes:
- Add scrapegraphai/docloaders/plasmate.py: PlasmateLoader
  - Implements BaseLoader (lazy_load + alazy_load)
  - Calls plasmate binary via subprocess (pip install plasmate)
  - Supports output_format: 'text' (default), 'som', 'markdown', 'links'
  - Supports --selector, --header, --timeout flags
  - Optional fallback_to_chrome=True for JS-heavy SPAs
  - Async-safe: runs subprocess in executor thread pool
- Update scrapegraphai/docloaders/__init__.py: export PlasmateLoader
- Update scrapegraphai/nodes/fetch_node.py: support plasmate config dict
  in FetchNode (alongside browser_base and scrape_do)
- Add tests/test_plasmate.py: 25 unit tests (init, cmd building,
  lazy_load, alazy_load, fallback, error handling)

Usage:
  from scrapegraphai.docloaders import PlasmateLoader

  loader = PlasmateLoader(
      urls=['https://docs.python.org/3/library/json.html'],
      output_format='text',
      timeout=30,
      fallback_to_chrome=True,  # optional: retry with Chrome for SPAs
  )
  docs = loader.load()

  # Or via FetchNode config:
  graph_config = {
      'plasmate': {
          'output_format': 'text',
          'timeout': 30,
          'fallback_to_chrome': False,
      }
  }
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Apr 8, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9dd1fb54ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +362 to +366
loader = PlasmateLoader(
[source],
output_format=plasmate_cfg.get("output_format", "text"),
timeout=plasmate_cfg.get("timeout", self.timeout or 30),
selector=plasmate_cfg.get("selector"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Forward Chromium options into Plasmate fallback

In FetchNode.handle_web_source, the PlasmateLoader instance is built without forwarding Chromium settings (headless, storage_state, loader_kwargs) as chrome_kwargs. When plasmate.fallback_to_chrome=True and Plasmate returns empty content (common on JS-heavy/authenticated pages), _fallback_fetch() creates ChromiumLoader with defaults, so fallback silently ignores configured auth/proxy/timeout behavior and can fetch incorrect content or hang longer than expected. Pass the existing Chromium options through this constructor to keep fallback behavior consistent with the normal Chromium path.

Useful? React with 👍 / 👎.

Comment thread tests/test_plasmate.py
Comment on lines +195 to +197
with patch(
"scrapegraphai.docloaders.plasmate.ChromiumLoader",
return_value=mock_chrome_loader,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Patch ChromiumLoader at the correct import site in tests

This test patches scrapegraphai.docloaders.plasmate.ChromiumLoader, but ChromiumLoader is imported inside _fallback_fetch and is not a module-level attribute in plasmate.py. As written, patch(...) raises AttributeError before the test logic runs (same pattern appears again in test_no_fallback_when_content_present), so the new fallback tests fail instead of validating behavior. Patch the symbol where it is actually resolved (or expose it at module scope).

Useful? React with 👍 / 👎.

@VinciGit00 VinciGit00 merged commit 1238738 into ScrapeGraphAI:main Apr 9, 2026
1 check passed
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

🎉 This PR is included in version 1.76.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request released on @stable size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Plasmate as a lightweight scraping backend - no Chrome needed

2 participants