Skip to content

fix: Constrain default sitemap loading#1956

Open
Pijukatel wants to merge 2 commits into
masterfrom
sitemaps-different-url
Open

fix: Constrain default sitemap loading#1956
Pijukatel wants to merge 2 commits into
masterfrom
sitemaps-different-url

Conversation

@Pijukatel
Copy link
Copy Markdown
Collaborator

Description

Constrain default sitemap loading.
Allow previous behavior by passing a more relaxed enqueue strategy.

Testing

  • Unit tests

Checklist

  • CI passed

tonghuaroot and others added 2 commits June 8, 2026 15:47
…ducer

The 1.7.0 fix for GHSA-3r75-xc34-5f44 wired the same-hostname enqueue
strategy into SitemapRequestLoader._passes_filters, but the lower-level
parse_sitemap / Sitemap.load / Sitemap.try_common_names API still
accepted every nested-sitemap <loc> and every <urlset><url><loc>
regardless of host. A sitemap on attacker.example could push
http://127.0.0.1:... or http://169.254.169.254/... into the queue, and
_fetch_and_process_sitemap would dispatch the request through the
configured HTTP client.

Move the filter_url check from SitemapRequestLoader._passes_filters down
into _process_sitemap_item so the same policy applies to both pipelines.
ParseSitemapOptions gains an enqueue_strategy field (default
'same-hostname', matching the loader default added in PR #1864). The
strategy is threaded through _process_raw_source and
_fetch_and_process_sitemap so producer-side filtering runs whether the
sitemap content arrived as a raw blob or via the HTTP client.

SitemapRequestLoader now stamps its configured enqueue_strategy into
ParseSitemapOptions, so its existing _passes_filters call remains
defence-in-depth rather than the sole gate. Callers that legitimately
need cross-host sitemap discovery opt in with
ParseSitemapOptions(enqueue_strategy='same-domain') / 'all'.

Note: this closes the URL-injection (read-back) path. A blind GET
against the redirect target can still occur because the HTTP-client
stream() follows 3xx with follow_redirects=True; closing that fully
needs a hook on stream() to re-run filter_url after redirect. Out of
scope for the minimal producer-side fix; tracked as a follow-up.

Signed-off-by: tonghuaroot <tonghuaroot@gmail.com>
@github-actions github-actions Bot added this to the 142nd sprint - Tooling team milestone Jun 8, 2026
@github-actions github-actions Bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Jun 8, 2026
@Pijukatel Pijukatel requested a review from vdusek June 8, 2026 13:59
@Pijukatel Pijukatel changed the title chore: Constrain default sitemap loading fix: Constrain default sitemap loading Jun 8, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.00%. Comparing base (66b4b5b) to head (e6b5671).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1956      +/-   ##
==========================================
+ Coverage   92.98%   93.00%   +0.02%     
==========================================
  Files         167      167              
  Lines       11712    11727      +15     
==========================================
+ Hits        10890    10907      +17     
+ Misses        822      820       -2     
Flag Coverage Δ
unit 93.00% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Pijukatel Pijukatel added the adhoc Ad-hoc unplanned task added during the sprint. label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants