Skip to content

fix(scrapy): replace pickle with JSON serialization in request queue path (CWE-502)#945

Open
sebastiondev wants to merge 2 commits into
apify:masterfrom
sebastiondev:fix/cwe502-requests-pickle-333a
Open

fix(scrapy): replace pickle with JSON serialization in request queue path (CWE-502)#945
sebastiondev wants to merge 2 commits into
apify:masterfrom
sebastiondev:fix/cwe502-requests-pickle-333a

Conversation

@sebastiondev
Copy link
Copy Markdown

Summary

Replace pickle.loads()/pickle.dumps() with JSON-based serialization in the Scrapy request serialization path (src/apify/scrapy/requests.py). The existing code deserializes data from the Apify RequestQueue API using Python's pickle module, which allows arbitrary code execution when processing untrusted data (CWE-502: Deserialization of Untrusted Data).

Vulnerability details

Affected function: to_scrapy_request() in src/apify/scrapy/requests.py

Data flow:
Apify RequestQueue APIapify_request.user_data['scrapy_request']base64 decodepickle.loads() → arbitrary code execution

The ApifyScheduler.next_request() method in scheduler.py calls self._rq.fetch_next_request() to retrieve a request from the Apify platform API, then passes it to to_scrapy_request(). The serialized Scrapy request stored in user_data['scrapy_request'] is base64-decoded and fed directly to pickle.loads() with no validation.

Because RequestQueue data traverses the Apify cloud platform and queues can be shared between Actors, a compromised or malicious Actor with write access to a shared queue could inject a crafted pickle payload. When another Actor's Scrapy scheduler fetches and deserializes this entry, arbitrary Python code executes in that Actor's container. This enables lateral movement between Actor containers on the platform.

Proof of concept

An attacker with write access to a shared RequestQueue can inject a malicious payload:

import pickle
import codecs
import os

# Craft a payload that executes arbitrary code on deserialization
class Exploit:
    def __reduce__(self):
        return (os.system, ('echo EXPLOITED > /tmp/pwned',))

# Encode exactly as the SDK expects
malicious_payload = codecs.encode(pickle.dumps(Exploit()), 'base64').decode()

# Insert into a RequestQueue entry via the Apify API:
# PUT /v2/request-queues/{queueId}/requests
# Body: { "url": "https://example.com", "uniqueKey": "...",
#          "userData": { "scrapy_request": "<malicious_payload>" } }
#
# When another Actor fetches this request via ApifyScheduler.next_request(),
# pickle.loads() executes the payload.

After this fix, the same pickle payload is rejected with a json.JSONDecodeError or UnicodeDecodeError because the new code expects valid JSON, not pickle bytes.

What the fix does

  1. Replaces pickle.dumps() / pickle.loads() with json.dumps() / json.loads() in both to_apify_request() and to_scrapy_request().

  2. Adds _encode_for_json() and _decode_from_json() helpers that handle the types pickle serialized implicitly:

    • bytes values (e.g. request body, header values) are wrapped as {"__b64__": "<base64>"} dicts
    • bytes dict keys (Scrapy headers use them) are decoded to UTF-8 strings for JSON, then re-encoded on deserialization
    • Pydantic BaseModel instances are converted via model_dump()
    • Enum members are replaced with their .value
  3. Restores bytes header keys after JSON deserialization, since Scrapy's request_from_dict expects them.

The serialization format change is backward-incompatible for in-flight queue data serialized with the old pickle format, but this is intentional — accepting pickle payloads is the vulnerability. The old format will raise a clear error (json.JSONDecodeError) rather than silently deserializing arbitrary objects.

Tests added

Four new test cases in tests/unit/scrapy/requests/test_to_scrapy_request.py:

Test Purpose
test_pickle_payload_rejected Confirms a pickle-serialized payload raises an exception instead of being deserialized
test_roundtrip_serialization Verifies to_apify_request()to_scrapy_request() roundtrip preserves URL, method, body, and headers
test_no_pickle_in_serialized_output Asserts that to_apify_request() output is valid JSON, not pickle
test_with_reconstruction / test_with_reconstruction_with_optional_fields Existing tests updated with JSON-encoded fixtures

All existing tests in the file were updated to use JSON-encoded test fixtures instead of the old pickle-encoded constants.

Adversarial review

Before submitting, we attempted to disprove this finding. We considered whether platform-level access controls prevent exploitation: the Apify API does require a token, but having an API token does not grant direct code execution on another Actor's container — pickle deserialization provides that escalation. In the shared queue scenario (multiple Actors reading from the same RequestQueue), a compromised Actor can achieve lateral movement to other Actors' containers. We also confirmed that no input validation or sanitization exists between the API fetch and the pickle.loads() call.

Note on remaining pickle usage

src/apify/scrapy/extensions/_httpcache.py also uses pickle.dump() (line 168) and pickle.load() (line 175) for HTTP cache data stored in KeyValueStore. The same class of vulnerability applies there. This PR scopes the fix to the RequestQueue serialization path where the threat model is clearest (shared queues, cross-Actor data flow). The HTTP cache path may warrant a follow-up fix.


Submitted by Sebastion — autonomous open-source security research from Foundation Machines. Free for public repos via the Sebastion AI GitHub App.

Replace unsafe pickle.loads/dumps with JSON-based serialization when
converting Scrapy requests to/from Apify requests. The previous code
used pickle to serialize the scrapy_request dict stored in user_data,
which allowed arbitrary code execution if an attacker could inject a
crafted payload into the Apify Request Queue (e.g. via shared queue
access or API).

The new implementation:
- Serializes with json.dumps() using a recursive encoder that wraps
  bytes values as {"__b64__": "<base64>"} and converts bytes dict keys
  (used by Scrapy headers) to UTF-8 strings
- Deserializes with json.loads() plus a reverse decoder
- Handles Pydantic BaseModel instances (e.g. CrawleeRequestData) by
  calling model_dump() during encoding
- Restores bytes header keys on deserialization for Scrapy compatibility

Tests updated with JSON-encoded fixtures and new tests verifying:
- Pickle payloads are rejected
- Roundtrip serialization works correctly
- Output is never pickle-formatted
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants