This document describes the internal architecture of Reader, helping contributors understand how the system works.
┌─────────────────────────────────────────────────────────────────┐
│ Public API │
│ scrape() / crawl() │
└─────────────────────────────┬───────────────────────────────────┘
│
┌───────────────┴───────────────┐
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ Scraper │ │ Crawler │
│ Class │ │ Class │
└─────┬─────┘ └─────┬─────┘
│ │
└───────────────┬───────────────┘
│
┌─────────▼─────────┐
│ BrowserPool │
│ (Hero Manager) │
└─────────┬─────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌─────────▼────────┐ ┌────────▼────────┐ ┌────────▼────────┐
│ Hero Config │ │ Cloudflare │ │ Formatters │
│ (TLS, DNS, etc.) │ │ Detection │ │ (MD, HTML, etc) │
└──────────────────┘ └─────────────────┘ └─────────────────┘
src/
├── index.ts # Public API exports
├── scraper.ts # Scraper class - main scraping logic
├── crawler.ts # Crawler class - link discovery + scraping
├── types.ts # ScrapeOptions, ScrapeResult, etc.
├── crawl-types.ts # CrawlOptions, CrawlResult, etc.
│
├── browser/
│ ├── pool.ts # BrowserPool - manages Hero instances
│ ├── hero-config.ts # Hero configuration (TLS, DNS, viewport)
│ └── types.ts # IBrowserPool, PoolConfig, PoolStats
│
├── cloudflare/
│ ├── detector.ts # detectChallenge() - DOM/text matching
│ ├── handler.ts # waitForChallengeResolution() - polling
│ └── types.ts # ChallengeDetection, ResolutionResult
│
├── formatters/
│ ├── markdown.ts # formatToMarkdown() - uses supermarkdown
│ ├── html.ts # formatToHTML() - full HTML document
│ ├── json.ts # formatToJson() - structured JSON
│ ├── text.ts # formatToText() - plain text
│ └── index.ts # Re-exports all formatters
│
├── utils/
│ ├── content-cleaner.ts # cleanContent() - removes nav, ads
│ ├── metadata-extractor.ts # extractMetadata() - OG tags, etc.
│ ├── url-helpers.ts # URL validation, normalization
│ ├── rate-limiter.ts # Simple delay-based rate limiting
│ └── logger.ts # Pino logger with pretty print
│
├── proxy/
│ └── config.ts # createProxyUrl(), parseProxyUrl()
│
└── cli/
└── index.ts # CLI using Commander.js
The Scraper class (src/scraper.ts) handles URL scraping:
class Scraper {
constructor(options: ScrapeOptions) { ... }
async scrape(): Promise<ScrapeResult> {
// 1. Initialize browser pool
// 2. Process URLs with concurrency control (p-limit)
// 3. For each URL: fetch, detect challenges, extract content
// 4. Format to requested output formats
// 5. Aggregate results and metadata
}
private async scrapeSingleUrl(url: string): Promise<WebsiteScrapeResult> {
// 1. Acquire browser from pool
// 2. Navigate to URL
// 3. Detect Cloudflare challenge
// 4. Wait for resolution if needed
// 5. Extract HTML and metadata
// 6. Clean content
// 7. Format to outputs
// 8. Release browser to pool
}
}Key design decisions:
- Uses
p-limitfor concurrency control - Each URL gets its own browser instance from the pool
- Cloudflare detection runs before content extraction
- All formatters run in parallel for each URL
The Crawler class (src/crawler.ts) discovers links:
class Crawler {
async crawl(): Promise<CrawlResult> {
// BFS (Breadth-First Search) algorithm
// 1. Start with seed URL at depth 0
// 2. Fetch page, extract links
// 3. Filter links (same domain, patterns)
// 4. Add to queue with depth + 1
// 5. Repeat until maxPages or maxDepth
// 6. Optionally scrape discovered URLs
}
}Key design decisions:
- BFS ensures shallow pages are discovered first
- Respects
maxPagesanddepthlimits - Optional scraping reuses the Scraper class
- Delay between requests for rate limiting
The BrowserPool class (src/browser/pool.ts) manages Hero instances:
class BrowserPool {
private instances: HeroInstance[];
private available: HeroInstance[];
private queue: PendingRequest[];
async initialize(): Promise<void> { ... }
async acquire(): Promise<Hero> { ... }
async release(hero: Hero): Promise<void> { ... }
async withBrowser<T>(fn: (hero: Hero) => Promise<T>): Promise<T> {
const hero = await this.acquire();
try {
return await fn(hero);
} finally {
await this.release(hero);
}
}
}Pool lifecycle:
- Initialize - Create
sizeHero instances - Acquire - Get available instance or queue the request
- Use - Execute scraping logic
- Release - Return to pool or recycle if stale
- Recycle - Close old instance, create new one
- Shutdown - Close all instances
Recycling triggers:
- After N pages (default: 100)
- After N minutes (default: 30)
- On health check failure
Detection happens in two phases:
1. Challenge Detection (src/cloudflare/detector.ts):
async function detectChallenge(hero: Hero): Promise<ChallengeDetection> {
// Check DOM for challenge elements
const signals = [];
// CSS selectors that indicate challenges
if (await hero.document.querySelector("#challenge-form")) {
signals.push({ type: "dom", selector: "#challenge-form" });
}
// Text patterns that indicate challenges
const bodyText = await hero.document.body.textContent;
if (bodyText.includes("checking your browser")) {
signals.push({ type: "text", pattern: "checking your browser" });
}
return {
isChallenge: signals.length > 0,
type: determineType(signals),
signals,
};
}2. Challenge Resolution (src/cloudflare/handler.ts):
async function waitForChallengeResolution(
hero: Hero,
options: ResolutionOptions
): Promise<ResolutionResult> {
const startTime = Date.now();
while (Date.now() - startTime < options.maxWaitMs) {
// Check if URL changed (redirect after challenge)
if ((await hero.url) !== options.initialUrl) {
return { resolved: true, method: "redirect" };
}
// Check if challenge elements disappeared
const detection = await detectChallenge(hero);
if (!detection.isChallenge) {
return { resolved: true, method: "element_removal" };
}
await sleep(options.pollIntervalMs);
}
return { resolved: false };
}Each formatter transforms scraped pages into a specific format:
| Formatter | Input | Output |
|---|---|---|
formatToMarkdown |
Pages, metadata | Markdown document with frontmatter |
formatToHTML |
Pages, metadata | Complete HTML document with CSS |
formatToJson |
Pages, metadata | Structured JSON object |
formatToText |
Pages, metadata | Plain text extraction |
Markdown formatter uses supermarkdown - a high-performance Rust-based HTML-to-Markdown converter with full GFM support.
scrape({ urls: ["https://example.com"], formats: ["markdown"] })
│
├─► Scraper.scrape()
│ │
│ ├─► BrowserPool.initialize(size=concurrency)
│ │
│ ├─► For each URL (controlled by p-limit):
│ │ │
│ │ ├─► pool.withBrowser(async hero => {
│ │ │ │
│ │ │ ├─► hero.goto(url)
│ │ │ │
│ │ │ ├─► detectChallenge(hero)
│ │ │ │ └─► Returns { isChallenge, type, signals }
│ │ │ │
│ │ │ ├─► if (isChallenge):
│ │ │ │ └─► waitForChallengeResolution(hero)
│ │ │ │
│ │ │ ├─► Extract title, HTML
│ │ │ │
│ │ │ ├─► cleanContent(html)
│ │ │ │ └─► Remove nav, ads, scripts
│ │ │ │
│ │ │ ├─► extractMetadata(html)
│ │ │ │ └─► OG tags, Twitter cards, etc.
│ │ │ │
│ │ │ └─► Format to requested formats
│ │ │ })
│ │ │
│ │ └─► Add to results array
│ │
│ ├─► pool.shutdown()
│ │
│ └─► Return ScrapeResult { data[], batchMetadata }
│
└─► Result returned to caller
crawl({ url: "https://example.com", depth: 2, scrape: true })
│
├─► Crawler.crawl()
│ │
│ ├─► Initialize queue with seed URL at depth 0
│ │
│ ├─► BFS loop (while queue not empty && pages < maxPages):
│ │ │
│ │ ├─► Dequeue next URL
│ │ │
│ │ ├─► Fetch page with Hero
│ │ │
│ │ ├─► Extract links via regex
│ │ │
│ │ ├─► Filter links:
│ │ │ ├─► Same domain only
│ │ │ ├─► Match includePatterns
│ │ │ └─► Exclude excludePatterns
│ │ │
│ │ ├─► Add new links to queue with depth + 1
│ │ │
│ │ ├─► Rate limit (delay between requests)
│ │ │
│ │ └─► Add to discovered URLs
│ │
│ ├─► If scrape=true:
│ │ └─► scrape({ urls: discoveredUrls })
│ │
│ └─► Return CrawlResult { urls[], scraped?, metadata }
│
└─► Result returned to caller
Ulixee Hero was chosen for:
- Stealth - Advanced TLS fingerprinting and anti-detection
- Speed - Optimized for headless automation
- API - Clean async/await interface
- Stability - Production-tested at scale
We use a pool because:
- Browser startup is slow (~2-3 seconds)
- Memory overhead per browser is high
- Connection reuse improves performance
Trade-off: Stale browsers can accumulate state, so we recycle them periodically.
Multi-signal approach because:
- No single indicator is 100% reliable
- Cloudflare changes their challenge pages
- Different challenge types have different signatures
Detection signals include:
- DOM elements (
#challenge-form,.cf-browser-verification) - Text patterns ("checking your browser", "ray id")
- URL patterns (
/cdn-cgi/challenge-platform/) - HTTP status codes
We clean HTML before formatting because:
- Navigation, ads, scripts bloat output
- LLMs perform better with focused content
- Reduces token usage
Cleaning removes:
<script>,<style>tags- Navigation elements
- Footer/sidebar content
- Ad containers
- Hidden elements
-
Create
src/formatters/newformat.ts:export function formatToNewFormat( pages: Page[], baseUrl: string, scrapedAt: string, duration: number, metadata?: WebsiteMetadata ): string { // Your formatting logic }
-
Export from
src/formatters/index.ts -
Add to format type in
src/types.ts:formats?: Array<"markdown" | "html" | "json" | "text" | "newformat">
-
Call formatter in
src/scraper.ts
- Add to
ScrapeOptionsinsrc/types.ts - Add default in
DEFAULT_OPTIONS - Use in
Scraperclass viathis.options.newOption - Add CLI flag in
src/cli/index.tsif needed
- Detection patterns:
src/cloudflare/detector.ts - Resolution logic:
src/cloudflare/handler.ts
Currently testing is manual. Key test scenarios:
- Basic scraping - example.com
- Cloudflare-protected sites - Sites with JS challenges
- Batch scraping - Multiple URLs with concurrency
- Crawling - Multi-page discovery
- All output formats - Verify each formatter
- Browser Pool - Deep dive into pool management
- Cloudflare Bypass - Understanding antibot bypass
- Production Server - Shared Hero Core pattern