Don't store resources content in memory - stream to storage (fixes #386) by aivus · Pull Request #643 · website-scraper/node-website-scraper

aivus · 2026-07-03T11:09:30Z

Fixes #386

Reworks the scraper around streams so resource content is no longer stored in memory. Memory usage is now bounded by requestConcurrency instead of total website size.

Benchmark

npm run benchmark:memory (new script) — local server, 100 files × 20 MB (2 GB total), concurrency 8:

	v6.0.0	this PR
peak RSS	2274 MB	160 MB
wall time	2.6 s	1.6 s

How it works

lib/request.js uses got.stream and resolves at response-headers time with {url, statusCode, mimeType, encoding, metadata, stream}. A retry-event wrapper keeps got's transient-error retries working for streams (up to headers time; a connection dying mid-body can't be retried with streams).
lib/scraper.js: resources which need no modification (images, fonts, media, js — the bulk of bytes) are piped straight to storage at response time, while still holding the request-queue slot so requestConcurrency keeps bounding concurrent transfers. HTML/CSS still need full buffering for link rewriting (cheerio re-serializes the whole document), but their text is freed immediately after save.
This answers the open question in the issue about references to not-yet-downloaded pages: getReference only needs the child's filename, and filenames are derivable from response headers (mime → type → generateFilename) before the body is consumed.
Redirect dedup resolves to the already-requested resource outside the queue task (awaiting inside would deadlock at requestConcurrency: 1; covered by a regression test).
The FS plugin pipes resource.getContentStream() to disk and removes partially written files on mid-stream failure, so ignoreErrors: true never leaves partial garbage.
For multiple saveResource actions on a streamed resource, content is buffered once so every storage can read it (a tee with in-series actions would deadlock on backpressure). Still strictly better than v6, which always buffered everything.

Breaking changes (v7, documented in MIGRATION.md)

saveResource actions: resource.getText() → resource.getContentStream() (single-use for streamed resources; await buffer(...) from node:stream/consumers for whole-content storages)
afterResponse receives {response: {url, statusCode, headers, getBody()}} instead of the full got response. String returns removed; returning an object without body keeps the resource streaming (zero buffering)
scrape() result no longer carries content — getText() is null after save (also in onResourceSaved)
generateFilename's responseData no longer contains body
Binary files are written byte-for-byte from the network (previously round-tripped through a latin1 string) — bug-fix-class difference
responseType in the request option is ignored

website-scraper-puppeteer / website-scraper-phantom will need mechanical updates for the new afterResponse contract (response.body → await response.getBody()).

Testing

262 unit + functional tests passing (npm test), eslint clean
New test/functional/streaming suite: byte-fidelity of multi-MB binaries, content freed after save, both afterResponse paths, multiple saveResource storages, unconsumed streams, mid-stream failures (via a real local HTTP server — nock can't simulate a connection dying mid-body), redirect-to-already-requested-URL at requestConcurrency: 1
New test/unit/plugins.test.js for SaveResourceToFileSystemPlugin
E2E (npm run test-e2e): 20+ real sites pass; the only failures are antipenko.pp.ua, which is currently stuck in a server-side 301 self-redirect loop (fails on master too)

🤖 Generated with Claude Code

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Resolves at headers time with the body as a Readable stream. New afterResponse contract: {url, statusCode, headers, getBody()}. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Binary resources are saved immediately at response time while holding the request queue slot; html/css are buffered for link rewriting and their content is freed after save. Default FS plugin pipes streams to disk. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

) Benchmark (100 files x 20 MB, concurrency 8): - v6.0.0: peak RSS 2274 MB - v7.0.0: peak RSS 160 MB Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

# Conflicts: # README.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

aivus

High-effort automated review of the v7 streaming rework (diff master...issue-386-streams). Every finding below was independently re-verified against the code before posting (17/17 candidates confirmed, 0 refuted; deduplicated to 13 distinct issues).

Severity overview, most severe first:

Correctness — streaming lifecycle

Save failures of streamed child resources are silently swallowed (scraper.js:228)
Live got stream leaks when generateFilename throws (scraper.js:205) — reproduced live
Multiple saveResource actions re-buffer the whole body as a string, throwing at ~512 MiB (scraper.js:135)
getBody memoization race corrupts content under concurrent calls (request.js:66)

Correctness — undocumented v6→v7 behavior changes
5. afterResponse {body: null} now throws instead of skipping (request.js:109)
6. Default string-body encoding changed from binary to header-derived (request.js:98)
7. saveResource actions no longer run serially (scraper.js:141)

Correctness — FS plugin cleanup
8. Mid-stream failure leaves the output directory behind, breaking retries (save-resource-to-fs-plugin.js:24)
9. Cleanup rm can mask the original save error (save-resource-to-fs-plugin.js:26)

Cleanup
10. readAll duplicates stream/consumers buffer() (stream.js:1)
11. Five copy-pasted originalStream.destroy() branches (request.js:62)
12. Save failures logged as "failed to request resource" (scraper.js:238)
13. outputFile is dead code (fs.js:6)

🤖 Generated with Claude Code

aivus · 2026-07-04T21:00:44Z

+						// other types need no modification - stream them to storage immediately
+						resource.setContentStream(stream);
+						self.loadResource(resource);
+						await self.persistResource(resource);


[correctness] Save failures for streamed child resources are silently swallowed

In v6, waitForLoad saved every resource via series(), so a disk-full/EACCES error while writing any image rejected scrape() when ignoreErrors: false (the default). Now non-html/css resources are persisted inside the request queue task: the error rejects requestPromise, but for child resources that promise is only awaited via Promise.allSettled in downloadChildrenResources, so the rejection is discarded. scrape() resolves successfully while the saved output is silently missing files, and the error action never fires.

Note this call also bypasses the saveResource wrapper (line 155), so there isn't even a failed to save resource warn for child resources.

aivus · 2026-07-04T21:00:56Z

-					resource.setMetadata(responseData.metadata);
-				}
+					const { stream, ...responseInfo } = responseData;
+					const { filename } = await self.runActions('generateFilename', { resource, responseData: responseInfo });


[correctness] Live response stream leaks when generateFilename throws

If a generateFilename action throws (or returns null, making this destructuring throw a TypeError), the live got stream obtained at line 176 is never consumed or destroyed — cleanup only exists inside persistResource's finally and via readAll consumption, both of which are only reached later. The paused HTTP response keeps its TCP socket open indefinitely.

Reproduced live: with ignoreErrors: true and a filename plugin that throws for one URL, the process is left with open socket handles after scrape() resolves and never exits.

Same applies to any throw between request.get resolving and stream consumption (e.g. lines 202–217). A try/catch around that region that destroys responseData.stream before rethrowing would cover all of these paths.

aivus · 2026-07-04T21:01:09Z

+			if (resource.contentStream && this.actions.saveResource.length > 1) {
+				// live stream can be consumed only once - buffer content
+				// so each saveResource action can read it
+				resource.setText(await readAll(resource.contentStream, 'binary'));


[correctness] Multiple saveResource actions re-buffer the whole body as a string, breaking the streaming guarantee

With more than one saveResource action registered (e.g. filesystem + S3 upload), this buffers the full body via Buffer.concat(chunks).toString('binary'):

For files ≥ ~512 MiB, toString throws Cannot create a string longer than 0x1fffffe8 characters, so the resource fails to save.

Below that limit, the v7 headline promise that memory no longer grows with resource size is silently broken whenever a second saveResource action exists.

The latin1 string round-trip also allocates the content twice (Buffer → string → Buffer per action) instead of buffering once as a Buffer.

Buffering as a Buffer (and exposing it via the content stream per action) would at least fix the throw and the double allocation; documenting the multi-action buffering behavior would cover the rest.

aivus · 2026-07-04T21:01:20Z

+	const originalStream = responseData.stream;
+	let bufferedBody = null;
+
+	const getBody = async () => {


[correctness] getBody memoization is not concurrency-safe

The bufferedBody === null check and the assignment straddle an await. If an afterResponse action calls getBody() twice without awaiting the first call (e.g. Promise.all([hashBody(response), inspectBody(response)]) where each helper calls getBody), both invocations start reading the same one-shot stream and split the chunks between them — each gets a corrupted/partial body, and whichever finished last is what ends up written to disk.

Memoizing the promise instead of the resolved value fixes it:

let bufferedBodyPromise = null; const getBody = () => { bufferedBodyPromise ??= readStreamToBuffer(originalStream); return bufferedBodyPromise; };

aivus · 2026-07-04T21:01:31Z

+			bodyBuffer = Buffer.from(result.body, encoding);
+		} else {
+			originalStream.destroy();
+			throwAfterResponseTypeError(result.body);


[correctness / breaking change] afterResponse returning {body: null} now throws instead of skipping the resource

In v6, getData({body: null}) returned null and transformResult treated it as "skip this resource gracefully". Now {body: null} (or {body: undefined}) falls through to this branch and throws Wrong afterResponse result. Expected null or object..., but received object — with the default ignoreErrors: false, a v6 plugin doing this on a root URL aborts the entire scrape.

Either restore the skip semantics (if (result.body === null || result.body === undefined) { originalStream.destroy(); return null; } before the type check) or document the narrowing in MIGRATION.md.

aivus · 2026-07-04T21:02:46Z

+			}
+
 			logger.info('saving resource ' + resource + ' to fs');
 			await this.runActions('saveResource', {resource});


[correctness / breaking change] saveResource actions are no longer invoked serially

v6 saved all resources one at a time via series() in waitForLoad, so saveResource actions never overlapped. Now streamed resources are persisted inside their request queue tasks, concurrently and unbounded by default (requestConcurrency: Infinity). A plugin writing to a shared, non-concurrency-safe sink (appending to a zip/tar archive, a single log file, a serial DB connection) that worked in v6 will now produce corrupted output or "connection busy" errors.

Worth documenting in MIGRATION.md ("saveResource actions may run concurrently; use requestConcurrency: 1 or make your action concurrency-safe").

aivus · 2026-07-04T21:02:58Z

@@ -0,0 +1,11 @@
+async function readAll (stream, encoding) {


[cleanup] readAll re-implements stream/consumers buffer(), less robustly

lib/request.js already imports buffer from stream/consumers (as readStreamToBuffer); this hand-rolled equivalent plus its 34-line test file could be replaced with (await buffer(stream)).toString(encoding).

The manual Buffer.concat(chunks) is also less robust: it throws TypeError: list argument must be an Array of Buffers if a plugin supplies a string-mode stream (e.g. resource.setContentStream(Readable.from('text'))) and the multi-saveResource buffering path in scraper.js runs, whereas stream/consumers buffer() handles string chunks.

aivus · 2026-07-04T21:03:10Z

-function transformResult (result) {
-	const encoding = getEncoding(result);
-	const data = getData(result);
+async function applyAfterResponse ({ responseData, headers, afterResponse }) {


[cleanup] originalStream.destroy() is repeated across five separate branches

Every early exit in this function (catch at line 84, null result at 89, type check at 94, non-buffer body at 108, body-replaced at 112) must remember its own destroy() call — any future early-return or throw added here (e.g. a new validation on result.encoding) silently leaks the got socket unless the author adds a sixth call.

A simpler shape: run the body in one try/catch that destroys and rethrows, and destroy in a single place at the end whenever the returned stream !== originalStream and the body was never read.

aivus · 2026-07-04T21:03:20Z

+				}
+				return result;
 			}).catch(function handleError (err) {
 				logger.error('failed to request resource ' + resource);


[cleanup] Save failures of streamed resources are logged as "failed to request resource"

For non-html/css resources, persistResource is awaited inside the queue task (line 228), bypassing the saveResource wrapper's failed to save resource warn at line 159 — so a disk-full or permission error thrown by a saveResource action surfaces here as failed to request resource, pointing diagnostics at the network instead of storage.

aivus · 2026-07-04T21:03:40Z

+import { createWriteStream } from 'fs';
+import { pipeline } from 'stream/promises';

 async function outputFile (file, data, encoding) {


[cleanup] outputFile is now dead code

Its only caller (save-resource-to-fs-plugin.js) was switched to outputFileStream, and lib/utils/fs.js is not part of the package exports map — no caller remains anywhere in lib/, test/, or index.mjs. Keeping it means readers must reason about two write paths (buffered vs streamed) when only one is live; it can be deleted along with its export.

aivus and others added 7 commits July 3, 2026 10:18

Add Resource content stream model and stream/fs utils (#386)

6deb6cf

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Rewrite request layer around got.stream with retry support (#386)

9585992

Resolves at headers time with the body as a Readable stream. New afterResponse contract: {url, statusCode, headers, getBody()}. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add streaming test suites and FS plugin unit tests (#386)

74bc674

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add memory benchmark, document v7 streaming changes, bump to 7.0.0 (#386

a3c6754

) Benchmark (100 files x 20 MB, concurrency 8): - v6.0.0: peak RSS 2274 MB - v7.0.0: peak RSS 160 MB Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into issue-386-streams

ace35a4

# Conflicts: # README.md

Ignore benchmark directory in qlty code coverage

420a146

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

aivus commented Jul 4, 2026

View reviewed changes

Ignore qlty for test directory

17c58a5

aivus force-pushed the issue-386-streams branch from 60d8563 to 17c58a5 Compare July 5, 2026 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Don't store resources content in memory - stream to storage (fixes #386)#643

Don't store resources content in memory - stream to storage (fixes #386)#643
aivus wants to merge 8 commits into
masterfrom
issue-386-streams

aivus commented Jul 3, 2026

Uh oh!

aivus left a comment

Uh oh!

aivus Jul 4, 2026

Uh oh!

aivus Jul 4, 2026

Uh oh!

aivus Jul 4, 2026

Uh oh!

aivus Jul 4, 2026

Uh oh!

aivus Jul 4, 2026

Uh oh!

aivus Jul 4, 2026

Uh oh!

aivus Jul 4, 2026

Uh oh!

aivus Jul 4, 2026

Uh oh!

aivus Jul 4, 2026

Uh oh!

aivus Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

aivus commented Jul 3, 2026

Benchmark

How it works

Breaking changes (v7, documented in MIGRATION.md)

Testing

Uh oh!

aivus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant