Skip to content

Don't store resources content in memory - stream to storage (fixes #386)#643

Open
aivus wants to merge 8 commits into
masterfrom
issue-386-streams
Open

Don't store resources content in memory - stream to storage (fixes #386)#643
aivus wants to merge 8 commits into
masterfrom
issue-386-streams

Conversation

@aivus

@aivus aivus commented Jul 3, 2026

Copy link
Copy Markdown
Member

Fixes #386

Reworks the scraper around streams so resource content is no longer stored in memory. Memory usage is now bounded by requestConcurrency instead of total website size.

Benchmark

npm run benchmark:memory (new script) — local server, 100 files × 20 MB (2 GB total), concurrency 8:

v6.0.0 this PR
peak RSS 2274 MB 160 MB
wall time 2.6 s 1.6 s

How it works

  • lib/request.js uses got.stream and resolves at response-headers time with {url, statusCode, mimeType, encoding, metadata, stream}. A retry-event wrapper keeps got's transient-error retries working for streams (up to headers time; a connection dying mid-body can't be retried with streams).
  • lib/scraper.js: resources which need no modification (images, fonts, media, js — the bulk of bytes) are piped straight to storage at response time, while still holding the request-queue slot so requestConcurrency keeps bounding concurrent transfers. HTML/CSS still need full buffering for link rewriting (cheerio re-serializes the whole document), but their text is freed immediately after save.
  • This answers the open question in the issue about references to not-yet-downloaded pages: getReference only needs the child's filename, and filenames are derivable from response headers (mime → type → generateFilename) before the body is consumed.
  • Redirect dedup resolves to the already-requested resource outside the queue task (awaiting inside would deadlock at requestConcurrency: 1; covered by a regression test).
  • The FS plugin pipes resource.getContentStream() to disk and removes partially written files on mid-stream failure, so ignoreErrors: true never leaves partial garbage.
  • For multiple saveResource actions on a streamed resource, content is buffered once so every storage can read it (a tee with in-series actions would deadlock on backpressure). Still strictly better than v6, which always buffered everything.

Breaking changes (v7, documented in MIGRATION.md)

  • saveResource actions: resource.getText()resource.getContentStream() (single-use for streamed resources; await buffer(...) from node:stream/consumers for whole-content storages)
  • afterResponse receives {response: {url, statusCode, headers, getBody()}} instead of the full got response. String returns removed; returning an object without body keeps the resource streaming (zero buffering)
  • scrape() result no longer carries content — getText() is null after save (also in onResourceSaved)
  • generateFilename's responseData no longer contains body
  • Binary files are written byte-for-byte from the network (previously round-tripped through a latin1 string) — bug-fix-class difference
  • responseType in the request option is ignored

website-scraper-puppeteer / website-scraper-phantom will need mechanical updates for the new afterResponse contract (response.bodyawait response.getBody()).

Testing

  • 262 unit + functional tests passing (npm test), eslint clean
  • New test/functional/streaming suite: byte-fidelity of multi-MB binaries, content freed after save, both afterResponse paths, multiple saveResource storages, unconsumed streams, mid-stream failures (via a real local HTTP server — nock can't simulate a connection dying mid-body), redirect-to-already-requested-URL at requestConcurrency: 1
  • New test/unit/plugins.test.js for SaveResourceToFileSystemPlugin
  • E2E (npm run test-e2e): 20+ real sites pass; the only failures are antipenko.pp.ua, which is currently stuck in a server-side 301 self-redirect loop (fails on master too)

🤖 Generated with Claude Code

aivus and others added 7 commits July 3, 2026 10:18
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Resolves at headers time with the body as a Readable stream. New
afterResponse contract: {url, statusCode, headers, getBody()}.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Binary resources are saved immediately at response time while holding
the request queue slot; html/css are buffered for link rewriting and
their content is freed after save. Default FS plugin pipes streams to
disk.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
)

Benchmark (100 files x 20 MB, concurrency 8):
- v6.0.0: peak RSS 2274 MB
- v7.0.0: peak RSS 160 MB

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@aivus aivus left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-effort automated review of the v7 streaming rework (diff master...issue-386-streams). Every finding below was independently re-verified against the code before posting (17/17 candidates confirmed, 0 refuted; deduplicated to 13 distinct issues).

Severity overview, most severe first:

Correctness — streaming lifecycle

  1. Save failures of streamed child resources are silently swallowed (scraper.js:228)
  2. Live got stream leaks when generateFilename throws (scraper.js:205) — reproduced live
  3. Multiple saveResource actions re-buffer the whole body as a string, throwing at ~512 MiB (scraper.js:135)
  4. getBody memoization race corrupts content under concurrent calls (request.js:66)

Correctness — undocumented v6→v7 behavior changes
5. afterResponse {body: null} now throws instead of skipping (request.js:109)
6. Default string-body encoding changed from binary to header-derived (request.js:98)
7. saveResource actions no longer run serially (scraper.js:141)

Correctness — FS plugin cleanup
8. Mid-stream failure leaves the output directory behind, breaking retries (save-resource-to-fs-plugin.js:24)
9. Cleanup rm can mask the original save error (save-resource-to-fs-plugin.js:26)

Cleanup
10. readAll duplicates stream/consumers buffer() (stream.js:1)
11. Five copy-pasted originalStream.destroy() branches (request.js:62)
12. Save failures logged as "failed to request resource" (scraper.js:238)
13. outputFile is dead code (fs.js:6)

🤖 Generated with Claude Code

Comment thread lib/scraper.js
// other types need no modification - stream them to storage immediately
resource.setContentStream(stream);
self.loadResource(resource);
await self.persistResource(resource);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[correctness] Save failures for streamed child resources are silently swallowed

In v6, waitForLoad saved every resource via series(), so a disk-full/EACCES error while writing any image rejected scrape() when ignoreErrors: false (the default). Now non-html/css resources are persisted inside the request queue task: the error rejects requestPromise, but for child resources that promise is only awaited via Promise.allSettled in downloadChildrenResources, so the rejection is discarded. scrape() resolves successfully while the saved output is silently missing files, and the error action never fires.

Note this call also bypasses the saveResource wrapper (line 155), so there isn't even a failed to save resource warn for child resources.

Comment thread lib/scraper.js
resource.setMetadata(responseData.metadata);
}
const { stream, ...responseInfo } = responseData;
const { filename } = await self.runActions('generateFilename', { resource, responseData: responseInfo });

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[correctness] Live response stream leaks when generateFilename throws

If a generateFilename action throws (or returns null, making this destructuring throw a TypeError), the live got stream obtained at line 176 is never consumed or destroyed — cleanup only exists inside persistResource's finally and via readAll consumption, both of which are only reached later. The paused HTTP response keeps its TCP socket open indefinitely.

Reproduced live: with ignoreErrors: true and a filename plugin that throws for one URL, the process is left with open socket handles after scrape() resolves and never exits.

Same applies to any throw between request.get resolving and stream consumption (e.g. lines 202–217). A try/catch around that region that destroys responseData.stream before rethrowing would cover all of these paths.

Comment thread lib/scraper.js
if (resource.contentStream && this.actions.saveResource.length > 1) {
// live stream can be consumed only once - buffer content
// so each saveResource action can read it
resource.setText(await readAll(resource.contentStream, 'binary'));

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[correctness] Multiple saveResource actions re-buffer the whole body as a string, breaking the streaming guarantee

With more than one saveResource action registered (e.g. filesystem + S3 upload), this buffers the full body via Buffer.concat(chunks).toString('binary'):

  • For files ≥ ~512 MiB, toString throws Cannot create a string longer than 0x1fffffe8 characters, so the resource fails to save.
  • Below that limit, the v7 headline promise that memory no longer grows with resource size is silently broken whenever a second saveResource action exists.
  • The latin1 string round-trip also allocates the content twice (Buffer → string → Buffer per action) instead of buffering once as a Buffer.

Buffering as a Buffer (and exposing it via the content stream per action) would at least fix the throw and the double allocation; documenting the multi-action buffering behavior would cover the rest.

Comment thread lib/request.js
const originalStream = responseData.stream;
let bufferedBody = null;

const getBody = async () => {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[correctness] getBody memoization is not concurrency-safe

The bufferedBody === null check and the assignment straddle an await. If an afterResponse action calls getBody() twice without awaiting the first call (e.g. Promise.all([hashBody(response), inspectBody(response)]) where each helper calls getBody), both invocations start reading the same one-shot stream and split the chunks between them — each gets a corrupted/partial body, and whichever finished last is what ends up written to disk.

Memoizing the promise instead of the resolved value fixes it:

let bufferedBodyPromise = null;
const getBody = () => {
	bufferedBodyPromise ??= readStreamToBuffer(originalStream);
	return bufferedBodyPromise;
};

Comment thread lib/request.js
bodyBuffer = Buffer.from(result.body, encoding);
} else {
originalStream.destroy();
throwAfterResponseTypeError(result.body);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[correctness / breaking change] afterResponse returning {body: null} now throws instead of skipping the resource

In v6, getData({body: null}) returned null and transformResult treated it as "skip this resource gracefully". Now {body: null} (or {body: undefined}) falls through to this branch and throws Wrong afterResponse result. Expected null or object..., but received object — with the default ignoreErrors: false, a v6 plugin doing this on a root URL aborts the entire scrape.

Either restore the skip semantics (if (result.body === null || result.body === undefined) { originalStream.destroy(); return null; } before the type check) or document the narrowing in MIGRATION.md.

Comment thread lib/scraper.js
}

logger.info('saving resource ' + resource + ' to fs');
await this.runActions('saveResource', {resource});

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[correctness / breaking change] saveResource actions are no longer invoked serially

v6 saved all resources one at a time via series() in waitForLoad, so saveResource actions never overlapped. Now streamed resources are persisted inside their request queue tasks, concurrently and unbounded by default (requestConcurrency: Infinity). A plugin writing to a shared, non-concurrency-safe sink (appending to a zip/tar archive, a single log file, a serial DB connection) that worked in v6 will now produce corrupted output or "connection busy" errors.

Worth documenting in MIGRATION.md ("saveResource actions may run concurrently; use requestConcurrency: 1 or make your action concurrency-safe").

Comment thread lib/utils/stream.js
@@ -0,0 +1,11 @@
async function readAll (stream, encoding) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[cleanup] readAll re-implements stream/consumers buffer(), less robustly

lib/request.js already imports buffer from stream/consumers (as readStreamToBuffer); this hand-rolled equivalent plus its 34-line test file could be replaced with (await buffer(stream)).toString(encoding).

The manual Buffer.concat(chunks) is also less robust: it throws TypeError: list argument must be an Array of Buffers if a plugin supplies a string-mode stream (e.g. resource.setContentStream(Readable.from('text'))) and the multi-saveResource buffering path in scraper.js runs, whereas stream/consumers buffer() handles string chunks.

Comment thread lib/request.js
function transformResult (result) {
const encoding = getEncoding(result);
const data = getData(result);
async function applyAfterResponse ({ responseData, headers, afterResponse }) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[cleanup] originalStream.destroy() is repeated across five separate branches

Every early exit in this function (catch at line 84, null result at 89, type check at 94, non-buffer body at 108, body-replaced at 112) must remember its own destroy() call — any future early-return or throw added here (e.g. a new validation on result.encoding) silently leaks the got socket unless the author adds a sixth call.

A simpler shape: run the body in one try/catch that destroys and rethrows, and destroy in a single place at the end whenever the returned stream !== originalStream and the body was never read.

Comment thread lib/scraper.js
}
return result;
}).catch(function handleError (err) {
logger.error('failed to request resource ' + resource);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[cleanup] Save failures of streamed resources are logged as "failed to request resource"

For non-html/css resources, persistResource is awaited inside the queue task (line 228), bypassing the saveResource wrapper's failed to save resource warn at line 159 — so a disk-full or permission error thrown by a saveResource action surfaces here as failed to request resource, pointing diagnostics at the network instead of storage.

Comment thread lib/utils/fs.js
import { createWriteStream } from 'fs';
import { pipeline } from 'stream/promises';

async function outputFile (file, data, encoding) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[cleanup] outputFile is now dead code

Its only caller (save-resource-to-fs-plugin.js) was switched to outputFileStream, and lib/utils/fs.js is not part of the package exports map — no caller remains anywhere in lib/, test/, or index.mjs. Keeping it means readers must reason about two write paths (buffered vs streamed) when only one is live; it can be deleted along with its export.

@aivus aivus force-pushed the issue-386-streams branch from 60d8563 to 17c58a5 Compare July 5, 2026 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't store resources content in memory

1 participant