Don't store resources content in memory - stream to storage (fixes #386)#643
Don't store resources content in memory - stream to storage (fixes #386)#643aivus wants to merge 8 commits into
Conversation
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Resolves at headers time with the body as a Readable stream. New
afterResponse contract: {url, statusCode, headers, getBody()}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Binary resources are saved immediately at response time while holding the request queue slot; html/css are buffered for link rewriting and their content is freed after save. Default FS plugin pipes streams to disk. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
# Conflicts: # README.md
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
aivus
left a comment
There was a problem hiding this comment.
High-effort automated review of the v7 streaming rework (diff master...issue-386-streams). Every finding below was independently re-verified against the code before posting (17/17 candidates confirmed, 0 refuted; deduplicated to 13 distinct issues).
Severity overview, most severe first:
Correctness — streaming lifecycle
- Save failures of streamed child resources are silently swallowed (
scraper.js:228) - Live got stream leaks when
generateFilenamethrows (scraper.js:205) — reproduced live - Multiple
saveResourceactions re-buffer the whole body as a string, throwing at ~512 MiB (scraper.js:135) getBodymemoization race corrupts content under concurrent calls (request.js:66)
Correctness — undocumented v6→v7 behavior changes
5. afterResponse {body: null} now throws instead of skipping (request.js:109)
6. Default string-body encoding changed from binary to header-derived (request.js:98)
7. saveResource actions no longer run serially (scraper.js:141)
Correctness — FS plugin cleanup
8. Mid-stream failure leaves the output directory behind, breaking retries (save-resource-to-fs-plugin.js:24)
9. Cleanup rm can mask the original save error (save-resource-to-fs-plugin.js:26)
Cleanup
10. readAll duplicates stream/consumers buffer() (stream.js:1)
11. Five copy-pasted originalStream.destroy() branches (request.js:62)
12. Save failures logged as "failed to request resource" (scraper.js:238)
13. outputFile is dead code (fs.js:6)
🤖 Generated with Claude Code
| // other types need no modification - stream them to storage immediately | ||
| resource.setContentStream(stream); | ||
| self.loadResource(resource); | ||
| await self.persistResource(resource); |
There was a problem hiding this comment.
[correctness] Save failures for streamed child resources are silently swallowed
In v6, waitForLoad saved every resource via series(), so a disk-full/EACCES error while writing any image rejected scrape() when ignoreErrors: false (the default). Now non-html/css resources are persisted inside the request queue task: the error rejects requestPromise, but for child resources that promise is only awaited via Promise.allSettled in downloadChildrenResources, so the rejection is discarded. scrape() resolves successfully while the saved output is silently missing files, and the error action never fires.
Note this call also bypasses the saveResource wrapper (line 155), so there isn't even a failed to save resource warn for child resources.
| resource.setMetadata(responseData.metadata); | ||
| } | ||
| const { stream, ...responseInfo } = responseData; | ||
| const { filename } = await self.runActions('generateFilename', { resource, responseData: responseInfo }); |
There was a problem hiding this comment.
[correctness] Live response stream leaks when generateFilename throws
If a generateFilename action throws (or returns null, making this destructuring throw a TypeError), the live got stream obtained at line 176 is never consumed or destroyed — cleanup only exists inside persistResource's finally and via readAll consumption, both of which are only reached later. The paused HTTP response keeps its TCP socket open indefinitely.
Reproduced live: with ignoreErrors: true and a filename plugin that throws for one URL, the process is left with open socket handles after scrape() resolves and never exits.
Same applies to any throw between request.get resolving and stream consumption (e.g. lines 202–217). A try/catch around that region that destroys responseData.stream before rethrowing would cover all of these paths.
| if (resource.contentStream && this.actions.saveResource.length > 1) { | ||
| // live stream can be consumed only once - buffer content | ||
| // so each saveResource action can read it | ||
| resource.setText(await readAll(resource.contentStream, 'binary')); |
There was a problem hiding this comment.
[correctness] Multiple saveResource actions re-buffer the whole body as a string, breaking the streaming guarantee
With more than one saveResource action registered (e.g. filesystem + S3 upload), this buffers the full body via Buffer.concat(chunks).toString('binary'):
- For files ≥ ~512 MiB,
toStringthrowsCannot create a string longer than 0x1fffffe8 characters, so the resource fails to save. - Below that limit, the v7 headline promise that memory no longer grows with resource size is silently broken whenever a second
saveResourceaction exists. - The latin1 string round-trip also allocates the content twice (Buffer → string → Buffer per action) instead of buffering once as a Buffer.
Buffering as a Buffer (and exposing it via the content stream per action) would at least fix the throw and the double allocation; documenting the multi-action buffering behavior would cover the rest.
| const originalStream = responseData.stream; | ||
| let bufferedBody = null; | ||
|
|
||
| const getBody = async () => { |
There was a problem hiding this comment.
[correctness] getBody memoization is not concurrency-safe
The bufferedBody === null check and the assignment straddle an await. If an afterResponse action calls getBody() twice without awaiting the first call (e.g. Promise.all([hashBody(response), inspectBody(response)]) where each helper calls getBody), both invocations start reading the same one-shot stream and split the chunks between them — each gets a corrupted/partial body, and whichever finished last is what ends up written to disk.
Memoizing the promise instead of the resolved value fixes it:
let bufferedBodyPromise = null;
const getBody = () => {
bufferedBodyPromise ??= readStreamToBuffer(originalStream);
return bufferedBodyPromise;
};| bodyBuffer = Buffer.from(result.body, encoding); | ||
| } else { | ||
| originalStream.destroy(); | ||
| throwAfterResponseTypeError(result.body); |
There was a problem hiding this comment.
[correctness / breaking change] afterResponse returning {body: null} now throws instead of skipping the resource
In v6, getData({body: null}) returned null and transformResult treated it as "skip this resource gracefully". Now {body: null} (or {body: undefined}) falls through to this branch and throws Wrong afterResponse result. Expected null or object..., but received object — with the default ignoreErrors: false, a v6 plugin doing this on a root URL aborts the entire scrape.
Either restore the skip semantics (if (result.body === null || result.body === undefined) { originalStream.destroy(); return null; } before the type check) or document the narrowing in MIGRATION.md.
| } | ||
|
|
||
| logger.info('saving resource ' + resource + ' to fs'); | ||
| await this.runActions('saveResource', {resource}); |
There was a problem hiding this comment.
[correctness / breaking change] saveResource actions are no longer invoked serially
v6 saved all resources one at a time via series() in waitForLoad, so saveResource actions never overlapped. Now streamed resources are persisted inside their request queue tasks, concurrently and unbounded by default (requestConcurrency: Infinity). A plugin writing to a shared, non-concurrency-safe sink (appending to a zip/tar archive, a single log file, a serial DB connection) that worked in v6 will now produce corrupted output or "connection busy" errors.
Worth documenting in MIGRATION.md ("saveResource actions may run concurrently; use requestConcurrency: 1 or make your action concurrency-safe").
| @@ -0,0 +1,11 @@ | |||
| async function readAll (stream, encoding) { | |||
There was a problem hiding this comment.
[cleanup] readAll re-implements stream/consumers buffer(), less robustly
lib/request.js already imports buffer from stream/consumers (as readStreamToBuffer); this hand-rolled equivalent plus its 34-line test file could be replaced with (await buffer(stream)).toString(encoding).
The manual Buffer.concat(chunks) is also less robust: it throws TypeError: list argument must be an Array of Buffers if a plugin supplies a string-mode stream (e.g. resource.setContentStream(Readable.from('text'))) and the multi-saveResource buffering path in scraper.js runs, whereas stream/consumers buffer() handles string chunks.
| function transformResult (result) { | ||
| const encoding = getEncoding(result); | ||
| const data = getData(result); | ||
| async function applyAfterResponse ({ responseData, headers, afterResponse }) { |
There was a problem hiding this comment.
[cleanup] originalStream.destroy() is repeated across five separate branches
Every early exit in this function (catch at line 84, null result at 89, type check at 94, non-buffer body at 108, body-replaced at 112) must remember its own destroy() call — any future early-return or throw added here (e.g. a new validation on result.encoding) silently leaks the got socket unless the author adds a sixth call.
A simpler shape: run the body in one try/catch that destroys and rethrows, and destroy in a single place at the end whenever the returned stream !== originalStream and the body was never read.
| } | ||
| return result; | ||
| }).catch(function handleError (err) { | ||
| logger.error('failed to request resource ' + resource); |
There was a problem hiding this comment.
[cleanup] Save failures of streamed resources are logged as "failed to request resource"
For non-html/css resources, persistResource is awaited inside the queue task (line 228), bypassing the saveResource wrapper's failed to save resource warn at line 159 — so a disk-full or permission error thrown by a saveResource action surfaces here as failed to request resource, pointing diagnostics at the network instead of storage.
| import { createWriteStream } from 'fs'; | ||
| import { pipeline } from 'stream/promises'; | ||
|
|
||
| async function outputFile (file, data, encoding) { |
There was a problem hiding this comment.
[cleanup] outputFile is now dead code
Its only caller (save-resource-to-fs-plugin.js) was switched to outputFileStream, and lib/utils/fs.js is not part of the package exports map — no caller remains anywhere in lib/, test/, or index.mjs. Keeping it means readers must reason about two write paths (buffered vs streamed) when only one is live; it can be deleted along with its export.
Fixes #386
Reworks the scraper around streams so resource content is no longer stored in memory. Memory usage is now bounded by
requestConcurrencyinstead of total website size.Benchmark
npm run benchmark:memory(new script) — local server, 100 files × 20 MB (2 GB total), concurrency 8:How it works
lib/request.jsusesgot.streamand resolves at response-headers time with{url, statusCode, mimeType, encoding, metadata, stream}. Aretry-event wrapper keeps got's transient-error retries working for streams (up to headers time; a connection dying mid-body can't be retried with streams).lib/scraper.js: resources which need no modification (images, fonts, media, js — the bulk of bytes) are piped straight to storage at response time, while still holding the request-queue slot sorequestConcurrencykeeps bounding concurrent transfers. HTML/CSS still need full buffering for link rewriting (cheerio re-serializes the whole document), but their text is freed immediately after save.getReferenceonly needs the child's filename, and filenames are derivable from response headers (mime → type → generateFilename) before the body is consumed.requestConcurrency: 1; covered by a regression test).resource.getContentStream()to disk and removes partially written files on mid-stream failure, soignoreErrors: truenever leaves partial garbage.saveResourceactions on a streamed resource, content is buffered once so every storage can read it (a tee with in-series actions would deadlock on backpressure). Still strictly better than v6, which always buffered everything.Breaking changes (v7, documented in MIGRATION.md)
saveResourceactions:resource.getText()→resource.getContentStream()(single-use for streamed resources;await buffer(...)fromnode:stream/consumersfor whole-content storages)afterResponsereceives{response: {url, statusCode, headers, getBody()}}instead of the full got response. String returns removed; returning an object withoutbodykeeps the resource streaming (zero buffering)scrape()result no longer carries content —getText()isnullafter save (also inonResourceSaved)generateFilename'sresponseDatano longer containsbodyresponseTypein therequestoption is ignoredwebsite-scraper-puppeteer / website-scraper-phantom will need mechanical updates for the new
afterResponsecontract (response.body→await response.getBody()).Testing
npm test), eslint cleantest/functional/streamingsuite: byte-fidelity of multi-MB binaries, content freed after save, bothafterResponsepaths, multiplesaveResourcestorages, unconsumed streams, mid-stream failures (via a real local HTTP server — nock can't simulate a connection dying mid-body), redirect-to-already-requested-URL atrequestConcurrency: 1test/unit/plugins.test.jsforSaveResourceToFileSystemPluginnpm run test-e2e): 20+ real sites pass; the only failures areantipenko.pp.ua, which is currently stuck in a server-side 301 self-redirect loop (fails on master too)🤖 Generated with Claude Code