Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: apify/apify-sdk-python
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master
Choose a base ref
...
head repository: apify/apify-sdk-python
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: update-scrapy
Choose a head ref
Checking mergeability… Don’t worry, you can still create the pull request.
  • 1 commit
  • 10 files changed
  • 1 contributor

Commits on Jun 7, 2026

  1. fix(scrapy)!: Serialize requests and HTTP cache as JSON instead of pi…

    …ckle
    
    Scrapy requests stored in the Apify request queue and responses stored in the
    HTTP cache were serialized with pickle. Those storages hold JSON, while pickle
    (de)serializes a Python object graph, so both paths now use a single shared
    JSON serializer.
    
    Serialization:
    - Add scrapy/_serialization.py: a shared JSON (de)serializer used by both the
      request converter and the HTTP cache. Binary fields (body and the
      bytes-keyed headers) are base64-encoded and pydantic models (e.g. Crawlee's
      UserData) are dumped to plain dicts; no in-band sentinel is used, so no user
      value can collide with the encoding.
    - requests: (de)serialize via the shared serializer and, when reconstructing a
      request, only honor a `_class` that is already imported and is a
      scrapy.Request subclass instead of importing the dotted path.
    - httpcache: store and load cached responses as gzip-compressed JSON.
    
    Resilience and correctness:
    - requests: a non-JSON-serializable meta/cb_kwargs is logged with a traceback
      and the request is skipped (returns None per the function's contract)
      instead of being silently dropped or crashing; the header conversion is
      guarded so a request with non-UTF-8 header values is no longer dropped (its
      headers are preserved in the serialized request).
    - scheduler: reconstruct the request inside a try/except in next_request, so a
      malformed queue entry is logged and skipped instead of crashing the run.
    - httpcache: treat a malformed or legacy (pickle-format) cache entry as a
      cache miss so it is re-fetched and re-stored; make the cleanup item cap
      configurable via APIFY_HTTPCACHE_EXPIRATION_MAX_ITEMS and fix its
      off-by-one.
    
    Misc:
    - proxy middleware: fix an f-string so the TunnelError reason is interpolated,
      drop a stale docstring argument, and import get_basic_auth_header from utils.
    - logging: install the Scrapy configure_logging monkey-patch at most once.
    - async thread: make the run_coro default timeout configurable.
    - tests: regenerate the pinned fixtures for the JSON format and add coverage
      for binary body/headers round-trips, the sentinel-collision case, the
      `_class` checks, and rejection of pickle payloads.
    
    BREAKING CHANGE: Scrapy requests and HTTP cache entries are now stored as JSON
    instead of pickle. Entries written by a previous version (pickle format) can no
    longer be read; such requests are skipped and such cache entries are treated as
    a miss. Values in a request's `meta` and `cb_kwargs` must be JSON-serializable.
    vdusek committed Jun 7, 2026
    Configuration menu
    Copy the full SHA
    a2f7a32 View commit details
    Browse the repository at this point in the history
Loading