fix(scrapy): replace pickle with JSON serialization in request queue path (CWE-502)#945
Open
sebastiondev wants to merge 2 commits into
Open
fix(scrapy): replace pickle with JSON serialization in request queue path (CWE-502)#945sebastiondev wants to merge 2 commits into
sebastiondev wants to merge 2 commits into
Conversation
Replace unsafe pickle.loads/dumps with JSON-based serialization when
converting Scrapy requests to/from Apify requests. The previous code
used pickle to serialize the scrapy_request dict stored in user_data,
which allowed arbitrary code execution if an attacker could inject a
crafted payload into the Apify Request Queue (e.g. via shared queue
access or API).
The new implementation:
- Serializes with json.dumps() using a recursive encoder that wraps
bytes values as {"__b64__": "<base64>"} and converts bytes dict keys
(used by Scrapy headers) to UTF-8 strings
- Deserializes with json.loads() plus a reverse decoder
- Handles Pydantic BaseModel instances (e.g. CrawleeRequestData) by
calling model_dump() during encoding
- Restores bytes header keys on deserialization for Scrapy compatibility
Tests updated with JSON-encoded fixtures and new tests verifying:
- Pickle payloads are rejected
- Roundtrip serialization works correctly
- Output is never pickle-formatted
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace
pickle.loads()/pickle.dumps()with JSON-based serialization in the Scrapy request serialization path (src/apify/scrapy/requests.py). The existing code deserializes data from the Apify RequestQueue API using Python'spicklemodule, which allows arbitrary code execution when processing untrusted data (CWE-502: Deserialization of Untrusted Data).Vulnerability details
Affected function:
to_scrapy_request()insrc/apify/scrapy/requests.pyData flow:
Apify RequestQueue API→apify_request.user_data['scrapy_request']→base64 decode→pickle.loads()→ arbitrary code executionThe
ApifyScheduler.next_request()method inscheduler.pycallsself._rq.fetch_next_request()to retrieve a request from the Apify platform API, then passes it toto_scrapy_request(). The serialized Scrapy request stored inuser_data['scrapy_request']is base64-decoded and fed directly topickle.loads()with no validation.Because RequestQueue data traverses the Apify cloud platform and queues can be shared between Actors, a compromised or malicious Actor with write access to a shared queue could inject a crafted pickle payload. When another Actor's Scrapy scheduler fetches and deserializes this entry, arbitrary Python code executes in that Actor's container. This enables lateral movement between Actor containers on the platform.
Proof of concept
An attacker with write access to a shared RequestQueue can inject a malicious payload:
After this fix, the same pickle payload is rejected with a
json.JSONDecodeErrororUnicodeDecodeErrorbecause the new code expects valid JSON, not pickle bytes.What the fix does
Replaces
pickle.dumps()/pickle.loads()withjson.dumps()/json.loads()in bothto_apify_request()andto_scrapy_request().Adds
_encode_for_json()and_decode_from_json()helpers that handle the types pickle serialized implicitly:bytesvalues (e.g. request body, header values) are wrapped as{"__b64__": "<base64>"}dictsbytesdict keys (Scrapy headers use them) are decoded to UTF-8 strings for JSON, then re-encoded on deserializationBaseModelinstances are converted viamodel_dump().valueRestores
bytesheader keys after JSON deserialization, since Scrapy'srequest_from_dictexpects them.The serialization format change is backward-incompatible for in-flight queue data serialized with the old pickle format, but this is intentional — accepting pickle payloads is the vulnerability. The old format will raise a clear error (
json.JSONDecodeError) rather than silently deserializing arbitrary objects.Tests added
Four new test cases in
tests/unit/scrapy/requests/test_to_scrapy_request.py:test_pickle_payload_rejectedtest_roundtrip_serializationto_apify_request()→to_scrapy_request()roundtrip preserves URL, method, body, and headerstest_no_pickle_in_serialized_outputto_apify_request()output is valid JSON, not pickletest_with_reconstruction/test_with_reconstruction_with_optional_fieldsAll existing tests in the file were updated to use JSON-encoded test fixtures instead of the old pickle-encoded constants.
Adversarial review
Before submitting, we attempted to disprove this finding. We considered whether platform-level access controls prevent exploitation: the Apify API does require a token, but having an API token does not grant direct code execution on another Actor's container — pickle deserialization provides that escalation. In the shared queue scenario (multiple Actors reading from the same RequestQueue), a compromised Actor can achieve lateral movement to other Actors' containers. We also confirmed that no input validation or sanitization exists between the API fetch and the
pickle.loads()call.Note on remaining pickle usage
src/apify/scrapy/extensions/_httpcache.pyalso usespickle.dump()(line 168) andpickle.load()(line 175) for HTTP cache data stored in KeyValueStore. The same class of vulnerability applies there. This PR scopes the fix to the RequestQueue serialization path where the threat model is clearest (shared queues, cross-Actor data flow). The HTTP cache path may warrant a follow-up fix.Submitted by Sebastion — autonomous open-source security research from Foundation Machines. Free for public repos via the Sebastion AI GitHub App.