Comparing changes

…ckle Scrapy requests stored in the Apify request queue and responses stored in the HTTP cache were serialized with pickle. Those storages hold JSON, while pickle (de)serializes a Python object graph, so both paths now use a single shared JSON serializer. Serialization: - Add scrapy/_serialization.py: a shared JSON (de)serializer used by both the request converter and the HTTP cache. Binary fields (body and the bytes-keyed headers) are base64-encoded and pydantic models (e.g. Crawlee's UserData) are dumped to plain dicts; no in-band sentinel is used, so no user value can collide with the encoding. - requests: (de)serialize via the shared serializer and, when reconstructing a request, only honor a `_class` that is already imported and is a scrapy.Request subclass instead of importing the dotted path. - httpcache: store and load cached responses as gzip-compressed JSON. Resilience and correctness: - requests: a non-JSON-serializable meta/cb_kwargs is logged with a traceback and the request is skipped (returns None per the function's contract) instead of being silently dropped or crashing; the header conversion is guarded so a request with non-UTF-8 header values is no longer dropped (its headers are preserved in the serialized request). - scheduler: reconstruct the request inside a try/except in next_request, so a malformed queue entry is logged and skipped instead of crashing the run. - httpcache: treat a malformed or legacy (pickle-format) cache entry as a cache miss so it is re-fetched and re-stored; make the cleanup item cap configurable via APIFY_HTTPCACHE_EXPIRATION_MAX_ITEMS and fix its off-by-one. Misc: - proxy middleware: fix an f-string so the TunnelError reason is interpolated, drop a stale docstring argument, and import get_basic_auth_header from utils. - logging: install the Scrapy configure_logging monkey-patch at most once. - async thread: make the run_coro default timeout configurable. - tests: regenerate the pinned fixtures for the JSON format and add coverage for binary body/headers round-trips, the sentinel-collision case, the `_class` checks, and rejection of pickle payloads. BREAKING CHANGE: Scrapy requests and HTTP cache entries are now stored as JSON instead of pickle. Entries written by a previous version (pickle format) can no longer be read; such requests are skipped and such cache entries are treated as a miss. Values in a request's `meta` and `cb_kwargs` must be JSON-serializable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Uh oh!

Commits on Jun 7, 2026

This comparison is taking too long to generate.

Uh oh!