fix: Constrain default sitemap loading#1956
Open
Pijukatel wants to merge 2 commits into
Open
Conversation
…ducer The 1.7.0 fix for GHSA-3r75-xc34-5f44 wired the same-hostname enqueue strategy into SitemapRequestLoader._passes_filters, but the lower-level parse_sitemap / Sitemap.load / Sitemap.try_common_names API still accepted every nested-sitemap <loc> and every <urlset><url><loc> regardless of host. A sitemap on attacker.example could push http://127.0.0.1:... or http://169.254.169.254/... into the queue, and _fetch_and_process_sitemap would dispatch the request through the configured HTTP client. Move the filter_url check from SitemapRequestLoader._passes_filters down into _process_sitemap_item so the same policy applies to both pipelines. ParseSitemapOptions gains an enqueue_strategy field (default 'same-hostname', matching the loader default added in PR #1864). The strategy is threaded through _process_raw_source and _fetch_and_process_sitemap so producer-side filtering runs whether the sitemap content arrived as a raw blob or via the HTTP client. SitemapRequestLoader now stamps its configured enqueue_strategy into ParseSitemapOptions, so its existing _passes_filters call remains defence-in-depth rather than the sole gate. Callers that legitimately need cross-host sitemap discovery opt in with ParseSitemapOptions(enqueue_strategy='same-domain') / 'all'. Note: this closes the URL-injection (read-back) path. A blind GET against the redirect target can still occur because the HTTP-client stream() follows 3xx with follow_redirects=True; closing that fully needs a hook on stream() to re-run filter_url after redirect. Out of scope for the minimal producer-side fix; tracked as a follow-up. Signed-off-by: tonghuaroot <tonghuaroot@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1956 +/- ##
==========================================
+ Coverage 92.98% 93.00% +0.02%
==========================================
Files 167 167
Lines 11712 11727 +15
==========================================
+ Hits 10890 10907 +17
+ Misses 822 820 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Constrain default sitemap loading.
Allow previous behavior by passing a more relaxed enqueue strategy.
Testing
Checklist