Summary
Port the discoverValidSitemaps() utility from Crawlee JS to Python.
JS source: packages/utils/src/internals/sitemap.ts — #3392
How it works in JS
async function* discoverValidSitemaps(
urls: string[],
options?: { proxyUrl?: string; httpClient?: BaseHttpClient }
): AsyncIterable<string>
- Group input URLs by hostname
- For each domain, discover sitemaps from (in order):
Sitemap: entries in robots.txt
- Input URLs that match
/sitemap\.(xml|txt)(\.gz)?$/i
- HEAD-request probing of
/sitemap.xml, /sitemap.txt, /sitemap_index.xml (fallback)
- Deduplicate and process domains concurrently
Returns an async iterable yielding sitemap URLs as discovered.
What Python already has
Sitemap.try_common_names() — probes /sitemap.xml and /sitemap.txt for a single URL (missing /sitemap_index.xml)
RobotsTxtFile.find() + get_sitemaps() — fetches and extracts Sitemap: entries from robots.txt
What's missing: the orchestrating function that combines these steps, groups by hostname, validates via HEAD requests, detects direct sitemap URLs from input, and processes domains concurrently.
Summary
Port the
discoverValidSitemaps()utility from Crawlee JS to Python.JS source:
packages/utils/src/internals/sitemap.ts— #3392How it works in JS
Sitemap:entries in robots.txt/sitemap\.(xml|txt)(\.gz)?$/i/sitemap.xml,/sitemap.txt,/sitemap_index.xml(fallback)Returns an async iterable yielding sitemap URLs as discovered.
What Python already has
Sitemap.try_common_names()— probes/sitemap.xmland/sitemap.txtfor a single URL (missing/sitemap_index.xml)RobotsTxtFile.find()+get_sitemaps()— fetches and extractsSitemap:entries from robots.txtWhat's missing: the orchestrating function that combines these steps, groups by hostname, validates via HEAD requests, detects direct sitemap URLs from input, and processes domains concurrently.